Guest Post: The Caselaw Access Project — Then, Now, Tomorrow

On my left: the edge of the off-ramp, a modest guardrail, and a fifty-foot drop. On my right, in،g closer: a tractor-trailer determined to occupy my lane. I hit the ،kes. The truck kept rolling. Its wheels pressed into my car as it wedged me a،nst the curb and carved a tail-to-nose dent in my poor Toyota.

This was early 2015, on my commute to Cambridge, M،., the morning of a critical meeting at Harvard Law Sc،ol, where I worked. Harvard professor Jonathan Zittrain and l were sitting down with Daniel Lewis and Nik Reed, the founders of a legal research s،up named Ravel Law, along with lawyers from Harvard’s Office of General Counsel, Debevoise & Plimpton and Gundersen Dettmer. We’d all been working for over a year on a contract that would make it possible, someday in the future, for everyone to have free and open access to all the official court decisions ever published in the United States. After an exhausting year of negotiations, it was time to lock ourselves in a room and figure out if we had a deal.

About the Aut،r
Adam Ziegler is a lawyer and software builder. He led the Caselaw Access Project and other work at Harvard’s Li،ry Innovation Lab from 2014 to 2021. He works currently at TrueLaw, which helps law firms use AI to improve their operations and services.

Fast forward nine years, and that “someday in the future” finally is here. On March 1, 2024, our collective efforts on this project — the Caselaw Access Project — culminated in the full, unrestricted release of nearly 7 million U.S. state and federal court decisions representing the bulk of our nation’s common law. I had the privilege to lead this work at Harvard for almost eight years. Wrecked Toyota aside, it was a career-defining experience, and I’m immensely grateful to everyone at Harvard and Ravel w، worked hard to make it possible.

To mark the occasion, I wanted to share some of the project’s inside story, reflect on its impact and look ahead to what I ،pe this data will make possible in the future.

Why Even Do This Project?

Court decisions are public information — they’re aut،red by judges and issued publicly to tell us what the law is, and why. We all s،uld have free, easy access to the law, and no one s،uld ،n compe،ive advantage from having privileged access to the law itself.

But historically we’ve not treated the law this way. Instead, we’ve acted like our law is created and owned by the companies that publish it. Our courts, with few exceptions, have allowed publishers to control access to the law and to dictate ،w we read, study, cite and use the law. Naturally, publishers have prioritized their commercial interests. They’ve made the law scarce and expensive. The effect has been to stifle innovation and compe،ion in the field of legal information and, I would argue, to impede justice and the rule of law.

Harvard scanned 38.6 million pages from 39,796 books and converted it all into ma،e-readable text files.

This is why the Caselaw Access Project needed to happen and why it was worth doing, even with all the obstacles, frustrations and compromises along the way.

Let’s Make a Deal

I interviewed to join the Harvard Law Li،ry and manage the project in late 2013, about a year after Nik, Daniel and Prof. Zittrain (or “JZ” as he’s affectionately known) had hatched the idea for the project and s،ed working out a skeletal framework for a ،ential deal.

I’ll confess: when I first learned the project would not be paid for directly by Harvard, but instead would be funded by a venture-backed Silicon Valley s،up that would get a few years of special access in return, I almost bailed. I t،ught it was absurd. Why would Harvard rely on a fledgling s،up for this, especially at the cost of limiting access?

By the time we’d arranged ourselves around a conference table in early 2015, I had a different perspective. I’d spent the last year negotiating the deal with Daniel and Nik but also with Harvard’s many internal stake،lders. I’d come to understand that while Harvard’s libr،s and resources made the project uniquely possible, Harvard’s bureauc، and wealth also made the project virtually impossible. It was only through a capable partner like Ravel that the project had a real chance.

I’d also seen that Daniel, Nik and the Ravel team weren’t in it for purely commercial reasons. Alt،ugh our team knew we had to give Ravel a few privileged years to exploit the project’s data, we drove a hard bar،n to ensure the project would serve the interests of sc،lars and researchers and the broader public. Most importantly, we had to be certain that if (or when) one of the big publishers bought Ravel, their acquisition would not undermine the project’s goals. We had to make sure a buyer would be locked into continuing to support the project and would have no power to stop it. We were dealing with Ravel, but we were also negotiating a،nst Ravel’s future buyer.

This led us to push for a battery of onerous protections and commitments. Ravel’s acceptance of these terms made clear to me that even within the context of their commercial goals, they shared the public interest motivations of the project. Most legal tech s،ups make bold declarations about public interest, access to justice and democratizing the law when it suits them. Very few make company-defining commitments that put t،se priorities front and center.

Case law books waiting to be scanned.

Ultimately, by mid-2015, the deal had taken shape. Harvard would contribute the law books and run the scanning process inside the law li،ry. Ravel would pay for the scanning and subsequent data processing, including redaction of any extraneous material that didn’t originate from the courts. Both Harvard and Ravel would get access to the processed data. Harvard would have the right to share the data on a restricted basis immediately. Ravel would be obligated to provide public access from day one and would put its source code in escrow to secure this obligation. In exchange, Ravel would get an exclusive right to exploit the data commercially for roughly six years after we finished di،ization – until March 2024. If Ravel or its successor ever stopped providing public access to the data, they would lose their commercial advantage and all the data would go free.

The contract still took a couple more months to finalize. There were other terms that were important to the investors and university administrators we needed to approve the deal. There were a few dicey moments where it looked like everything might fall apart over trivial concerns. But finally, we closed the deal and the signature page، my inbox. A s،rt while later, we publicly announced the project and the key terms of the deal.

Then came the real work.

Making M، Di،ization Work Inside the Law Li،ry

Inside the li،ry, we’d been eagerly gearing up for the di،ization effort. In parallel with the negotiations, we’d run a proof of concept that allowed us to figure out the process, equipment, systems and s،ing we’d need to meet our quality standards. We’d carefully modeled out the costs and timing. We knew exactly ،w many pages per day we could scan, ،w much it would cost and what dials we could turn to alter cost or throughput if needed.

When the deal closed, we were ready to go. We’d already tackled many of the toughest challenges:

We didn’t know precisely which books to scan. There was no definitive list of “all the books containing official court decisions.” So we did research and made one.
We didn’t have all the books we needed. Like many law li،ries, we had stopped buying some of the books that contained official court decisions. They were too expensive, and almost no one ever used them. So we went out and bought books to fill the few gaps.
Most of the books weren’t physically in the li،ry building. They were 30 minutes away in the Harvard Depository, where they were mixed in with about 10 million other books. So we figured out ،w to get the 40,000 books we needed and move them over to the li،ry efficiently.
We had almost no book-level metadata, but we needed to record key information about every book, such as when it was published and what jurisdiction(s) it covered. We also needed to make sure there were no missing or damaged pages. So we created a process to visually inspect every book and to manually record the necessary metadata.
To scan the books at high s،d, we needed first to free the pages from their binding. So we bought a ma،e we called the “Guillotine,” which sliced through the spine of the books with a cra،ng thud. (Yes, there were physical safety considerations). The Guillotine was so heavy we had to put it on a reinforced floor. It was so loud we had to suspend work around exam time.
The high-s،d scanner was an amazing ma،e, but it wasn’t perfect, and so we had to do quality control on the scans to make sure they met our standards. Over the course of the scanning effort, we visually inspected roughly 20 million scanned pages.
After scanning, we had to preserve the books, just in case we needed to scan them a،n or someone needed to reboot democ،. So we used a vacuum-sealing ma،e designed for meat-packing to individually seal every book into a moisture-resistant bag before ،pping them all to an underground limestone mine in Kentucky.
We had to find ،e for all this in the li،ry, where students studied, faculty worked and libr،s served. We had to put the metadata stations, the Guillotine, the scanner, and the vacuum-sealer in separate areas, on various floors, which meant our team had to physically transport small carts full of books between stations on an elevator.
And finally we had to keep track of all 40,000 books every step of the way, so we could account for each one, continuously monitor our progress and verify that we had processed every book we needed to. So we built custom software and adapted a hand-scanner system so we could check in every book at each station.

Overcoming these practical challenges was the hardest work we did, and the success of this phase was due entirely to the professionalism, dedication and adaptability of the li،ry team in the face of quite a bit of pressure and skepticism, including from within Harvard. There were no high-paid consultants, distinguished t،ught-leaders or pompous muckety-mucks telling us ،w to do this. Mostly it was just a bunch of li،ry professionals, a programmer and a ،n overbearing lawyer rolling up our sleeves in the ba،t and striving together to figure it out because we cared. Real innovation.

How Imperfect Law Becomes Imperfect Data

Scanning was the hardest thing, but it wasn’t the only thing. We also had to transform 40 million scanned page images into structured data representing all of the individual cases, which could be displayed for people on the web, downloaded in bulk and served ma،e-to-ma،e through APIs.

We had a lot of help here, both from Ravel and from the vendor we relied on to handle the processing. What stands out especially from this phase are two, related things: redaction and imperfection.

The Unfortunate Need to Redact

In the project’s early years, the remote possibility that a legal publisher might try to stop our work loomed large. It consumed a lot of time, energy and resources, and it forced us to make compromises.

The ‘guillotine’ the project used for slicing books’ bindings.

The problem was this:

Many of the books that contain our official case law were published by companies that had a history of acting aggressively through litigation to prevent others from copying the law or from competing in the realm of legal information.
While no one would claim in good faith that court decisions aut،red by judges can be copyrighted by publishers, many publishers had adopted a practice of injecting into the text of judge-aut،red decisions a variety of editorial devices (such as headnotes and other annotations). In these, publishers did claim copyright.
This intermingling of editorial content with official statements of law has a contaminating effect. You cannot get your hands on the official common law wit،ut also tou،g editorial content, which is harmless to read but somewhat radioactive to copy and share.

To achieve our goals on the project, we had to contend with this gnarly problem. The only solution available to us was redaction.

Redaction means the removal or obfu،ion of unwanted information. “Unwanted” is exactly ،w we felt about the headnotes and other editorial materials embedded within the pages of the books we had scanned. We would have gladly worked with a “clean” version of the official law, but it did not exist. The only official version of the law was the contaminated one. And so we had to prioritize, above almost everything else, the accurate identification and removal of these unwanted materials from every page and every court decision that came out of every book that was not yet in the public domain. This was not easy.

The s،rt version of this story is: we had to figure out what editorial content to expect in the scanned pages; we had to be ready to alert on any unexpected content; we had to identify where this content lived within a case and on a page; we had to excise this material from the textual data; and we had to paint solid black boxes over the content on the scanned images. We had to do all of this with extreme precision to ensure that everyone could see the law and no one could see the editorial litter.

Now let me tell you what I really think. Headnotes, key numbers, annotations and the like can be useful. Viewed in their own right, they’re not garbage at all. They’re the ،uct of major investment and serious effort by trained professionals. There was a time when they were needed to ،ist the discovery and understanding of the law. They do deserve protection, but only as an independent enhancement layer that’s distinct from the law itself. When they’re combined with the official law in a way that interferes with propagation and access, they’re best viewed as pollution. It’s a great failure of our judges, courts and legislatures that they’ve allowed — and continue to allow to this day — commercial en،ies to mingle their owned commentary with our official law.

If you’re interested in learning more on this topic, I recommend reading the Supreme Court’s 2020 decision in Georgia v. Public.Resource.Org and the many briefs submitted supporting access to law, including the amicus brief that we filed. If you’re a redaction nut, please enjoy an example of our work on Vol. 323 of the Federal Reporter 2d.

Getting Comfortable with Imperfection

Because we invested so much in redaction, we had to make sacrifices elsewhere. The two biggest sacrifices were in the transcription of opinion text and in the scope of the project. We used a technology called optical character recognition (“OCR”) to extract all the case text from the scanned images. OCR output is not perfect. It typically requires some degree of ma،e and/or human correction. While we corrected some of the OCR output – text that identified parties and courts, for example – we did not correct the OCR output of the actual opinion text. In fact, the raw OCR quality is extremely good, and more than sufficient for most purposes. But it’s not perfect, and our law deserves perfection.

We also couldn’t keep di،izing the law forever. We had to limit the scope of the project, and we needed to turn our attention to the work of making the data accessible online. And so we had to end scanning in early 2017, alt،ugh eventually we were able to extend it into 2018.

I’ve heard people question these compromises, as if they made the project pointless. That’s bunk. We calculated that if we made sure to create and share high quality scanned images and metadata for the full historical record — the work that would be hardest to re،uce — technology would continue to improve and others (ideally the courts) would step up to contribute going forward. Indeed, this is what’s happening. OCR technology is much improved, and it’s not too hard to redo the OCR to get better results. With all the images and metadata now freely, publicly available for anyone to access, we can all go to work making the text fidelity even better.

As for the project’s scope, Ed Walters and the good folks at Fastcase (now vLex) generously agreed to share their transcriptions of some newer court decisions. At the same time, the non-profit Free Law Project, led by Mike Lissner, continues to set the standard and do a far better job than the government itself in providing widespread public access to newly issued court decisions and case dockets. The courts haven’t done their part yet, but I’m still ،peful.

So the data isn’t perfect. It’s a little bit stale. But these gaps are closing, and someday they’ll be gone.

Access, Exploration and Experimentation

Everything I’ve shared to this point was a precursor to the ultimate end goal: free public access online. Ironically, when we s،ed, we had no idea what public access would look like or if our team in the li،ry would deliver it. This is why we made sure the contract required Ravel to deliver public access.

An Awkward Dance

Then in June 2017, LexisNexis announced that it had bought Ravel. Their public statements expressed an intention to continue supporting the project and to follow through on Ravel’s commitments. Privately they said the same thing. They had little c،ice; they inherited the contract, and it was airtight. Either follow through and ،n the benefit of the remaining commercial exclusivity and a friendly relation،p with Harvard, or renege and see all the data — which by this time was nearly complete — go free immediately.

But words are easy. In practical reality, we were caught in an awkward dance in which Lexis did the minimum required under the inherited agreement, and only if we held them to it. Their follow-through on public access was perfunctory at best. I would’ve been happy to see Lexis lean into the opportunity and become a bold standard-bearer for true public access to law. I also would’ve been happy to see Lexis w،lly abandon the commitments Ravel had made. But now that the hard di،ization work was basically done, I had little interest in frantically waving around the contract and chasing Lexis to do so،ing it had no intrinsic motivation to do. I also knew it would be difficult and frustrating to get Harvard to throw any real ins،utional weight behind persuading Lexis to do much more.

So instead of focusing our energy on pu،ng Lexis, we s،ed working earnestly within the Li،ry Innovation Lab to take advantage of the rights the contract gave us to offer public access directly ourselves.

Delivering on Public Access

This was my favorite part of the project. This is what our Li،ry Innovation Lab loved to do and did best: design and code high-performing, open source software that would fulfill the fundamental li،ry mission of enabling access to knowledge.

We had almost free rein to build anything we wanted that would make it easier for people to read and study the law. The biggest question we faced was whether to try to build a free legal research tool that might subs،ute for expensive commercial ،ucts. We decided not to. Instead, we focused on providing direct access to the data. We wanted to enable others to build tools and ،ucts, and we wanted to explore new ways of interacting with the data. We did build a simple search and viewer interface for t،se w، just wanted to read a few cases, but we c،se mainly to prioritize things that commercial vendors would never do.

The scanner belts.

It’s hard in a post like this to describe the technology we built, so instead I’ll invite you to use the Caselaw Access Project and, if you’re so inclined, to copy and remix the project’s code. When you visit CAP today, you’ll see that the legacy site and tools are still available at https://old.case.law, but they’re set to sunset in September 2024 now that there are no restrictions on the data and everyone can do what only the Lab could do before. Check out Trends, an amazing interface built by the Lab’s current director, Jack Cushman, to allow people to explore ،w legal language and ideas evolved. Another favorite of mine is Colors, built by Anastasia Aizman in 2019 as an early, whimsical exploration of the data using natural language processing and neural networks.

The Virtues of Good Plumbing

These explorations mattered, but our biggest technical achievements were not the vivid demo applications we built ourselves. The real contribution was constructing the robust “plumbing” through which we could deliver the data to others.

The plumbing we built had two main parts: an API and a bulk data service. The technical details are amazing, and if technical details are your thing, stop reading and go look at the code. Reach out to Jack Cushman and the Lab’s current team to learn more about what we did. Find ways to contribute to the amazing work the Lab is doing now in the areas of legal AI and web arc،ing.

Broadly speaking, we designed the API for people writing computer programs that would need on-demand access to information about particular U.S. court decisions, or w، wanted to ،mize what they could do with their daily allotment of full-text cases. We designed the bulk data service for verified non-commercial researchers w، wanted to work with large volumes of court decisions to ،n some new insight or to investigate big ideas across the dataset.

One key emphasis was to slice and dice and repackage the data in as many ways as we could, to support the widest possible range of users and uses. As a result, now you can get PDFs of the scanned images, either as individual cases or w،le volumes. You can get cases as JSON or XML, with the text of opinions as plaintext, HTML or XML. You can get w،le cases, or just the metadata. You can get smaller datasets reflecting any of the time periods, jurisdictions, courts and ،les in the collection. You can get specialized datasets that reflect all the citation-based connections (the “citation graph”) a، cases. You also can create your own specialized datasets based on any search term and a variety of complex filters. If you want to quickly curate and download a dataset of all decisions issued between 1960 and 1990 by courts in Iowa, which mention “farm” and cite to the Indiana Supreme Court, go for it. If you want to put the entire collection of published U.S. court decisions on a thumb-drive, have at it.

Impact

While immersed in solving the practical challenges of the contract, the scanning, the processing and the delivery, we didn’t think much about the impact the project might have once we made the data available. We took it on faith that someone would find it useful.

We launched both the API and the bulk data service publicly in late 2018 and got a wave of favorable publicity. The one bit of recognition that stands out for me was an editorial in the The Harvard Crimson ،led “In Favor of the Caselaw Access Project.” For some reason, there’s so،ing special about a student publication expressing gra،ude for our work.

Publicity is not the same as impact. What really mattered was whether people used the data. For a while it was hard to know what people were doing, but now we can s، to see the evidence. If you look at references to the project on Google Sc،lar or SSRN, you’ll see ،dreds of articles across a dizzying array of topics like an،rust law, linguistics, judicial partisan،p, tax law, ، transplant litigation, ma،e learning and LLMs, legal pe،gy, and the long-term common law influence of cases involving en،d people, just to name a few. If you search on the web, you’ll see over 50 li،ry guides that highlight the project as a source for legal research or sc،larly data and ،dreds of t،usands of links into the project’s website. If you look at Reddit, you’ll see an endless scroll of posts mentioning the project in all sorts of useful (and some wild) contexts. If you look at Github or HuggingFace, you’ll see a growing number of technical projects using and crediting the project. If you talk to lots of legal tech s،ups, like I do, you’ll hear ،w much easier it is to s، so،ing new because of the project.

This is only what’s public and easy to find in a few minutes online, or what people I happen to talk to are willing to share. This is just in the relatively s،rt time since we launched the data, and all of it came during a period in which we had to artificially limit and condition access. Now the floodgates can open.

What Comes Next?

Now I ،ft from personal recollection and observation to speculation. What comes next as a result of the Caselaw Access Project? I don’t really know.

I believe the project will continue to enable sc،larly research that helps us better see the harmful patterns, prejudices and past failures of our legal system, so that we can work together toward so،ing much better. In law sc،ol, I learned a lot about civil procedure and commercial transactions but absolutely nothing about ،w our courts handled ،ry before the Civil War. For a long time, I was ignorant about the active efforts of our profession in perpetuating this sin. But through the Caselaw Access Project’s data and tools, I learned that protecting ،ry was one of our courts’ most prominent early priorities. Through important work by others using the data, I’ve learned that this shameful legacy continues to influence our law today.

I also have a strong ،ch that generative AI will transform the legal industry, and that the project’s data will play a meaningful role. My ،pe is that the project will make it easy for smart, creative people to explore new AI-enabled ideas that would have been impossible if the law remained locked away in books and proprietary databases. I’m happy that it will be within reach for anyone with technical s،s to build their own version of an AI legal ،istant, rather than it being reserved only to companies with special access to the law. I also suspect the project’s data will be part of the solution to citation hallucination, and I ،pe courts will soon realize that the root causes of this problem are bad lawyering and inaccessible law, not technology.

There are positive versions of the future in which the project contributes to tools and services that help lower the access to justice barrier, improve the quality and value of legal work and allow people to better understand their rights and obligations. These are the future scenarios I’m committed to and will continue working toward enthusiastically.

But there are also versions of the future in which technical experts, with no awareness of or regard for the nature of the law, might use the project’s data to i،vertently do dangerous and harmful things. Here I’ll share a word of caution about the common law and the data we’ve helped make available: it’s always complicated, often ، and frequently just plain evil. These cases are full of ،rrific details of violence and suffering. The people mentioned in the cases are real. Many of them, or their families and friends, are still living. And finally, over the course of the 350-plus years represented in the dataset, the law has often been ،rribly, disgustingly wrong. Don’t make the mistake of believing the law of Alabama or M،achusetts from 175 years ago is fit to inform a modern-day free legal advice chatbot. Don’t ،ume judges are always impartial or never prejudiced. Don’t presume all law is good law.

These are not reasons to keep the law closed or to continue giving privileged access to a few large companies, but they are compelling reasons for all of us to be t،ughtful about ،w we use and share the data. Perhaps they are reasons, going forward, for judges to think differently about ،w they write opinions and what details need to be made explicit for a decision to carry its weight.

***

All in all, I’m incredibly fortunate that I could contribute to this project and work closely with so many amazing people to see it through from idea to impact. I’ve been lucky to work on a lot of great projects, but this one stands alone in every way. So worth it.

منبع: https://www.lawnext.com/2024/03/guest-post-the-caselaw-access-project-then-now-tomorrow.html