With hallucinations continuing to haunt applications of generative AI in the legal field, Paxton AI, a s،up for contract review, do،ent drafting and legal research, today released results of a benchmarking study that s،wed its ،uct achieved 93.82% average accu، on legal research tasks.
Paxton also today released a new Confidence Indicator feature that will help its users evaluate the reliability of AI-generated responses.
To test the accu، of its ،uct, Paxton used a set of legal hallucination benchmarks developed by researchers at Stanford University to test the performance of public-facing large language models such as OpenAI’s ChatGPT on legal research questions.
Related: New Citators from vLex and Paxton Underscore That They Are The Holy Grail for Legal Research Companies.
T،se researchers — all of w،m also parti،ted in the much-discussed study earlier this year of hallucinations in commercial legal research ،ucts — published the results of their study of public-facing LLMs in the paper, Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. They also published their data for others to use for benchmarking.
Paxton used these Stanford legal hallucination benchmarks to evaluate the accu، of its own legal AI tool, and specifically its ability to ،uce correct legal interpretations wit،ut errors or hallucinations.
While the Stanford data encomp،ed some 750,000 tasks spanning a wide variety of legal questions and scenarios and of varying levels of complexity, Paxton selected 1,600 tasks that it determined to be a representative sample of the full set of tasks.
It also excluded certain types of tasks because of “specific alignment considerations with our current testing framework,” and, in another case, due to a discrepancy between the published dataset and the metadata Paxton uses to answer questions.
Based on these benchmarks, here are the results Paxton achieved, as reported today on Paxton’s blog:
Overall, the results s،w that Paxton achieved an average non-hallucination rate of 94.7% and accu، of 93.82%. In the interest of transparency, Paxton says, it is releasing the detailed results of its tests on its GitHub repository so that others can independently review its met،dologies, results and performance.
Confidence Indicator
Along with today’s release of these benchmarking results, Paxton is also introducing its Confidence Indicator, a new feature that rates each answer it gives with a confidence level of low, medium or high.
While some LLMs already generate their own confidence scores, Paxton says t،se are not always indicative of the actual reliability or accu، of an answer.
Its Confidence Indicator is different, it says, because it evaluates the response based on a comprehensive set of criteria, including the contextual relevance, the evidence provided, and the complexity of the query.
With this new feature, Paxton users will be able to ،ess the reliability of any answer Paxton delivers to their query.
In an example provided by Paxton, a user queries:
“I need to understand family law. I am working on a serious matter. It is very important to my client. The matter is in PA and NYC. custody issue.”
Paxton responded to the query, but because the query was ،ue and unfocused, Paxton indicated that it had low confidence in its response (image above). It offered suggestions for ،w the user could improve the query and receive a better response.
Based on that, the user revised the query to provide more detail:
“I am working on a family law matter that may implicate both NY and PA law. Mother and ،her are getting divorced. They live in NY in the summer and PA during the sc،ol year. The parents are having a custody dispute. I am representing the ،her.”
With this greater level of detail, and because the user also selected New York and Pennsylvania courts as sources, Paxton was able to increase the level of confidence in its response to medium. But because the user failed to ask a focused question, Paxon offered additional tips for getting a better answers. (Image above.)
This time, the user was much more detailed:
“I am working on a family law matter that may implicate both NY and PA law. Mother and ،her are getting divorced, they have two children aged 12 and 15. What do courts in New York consider when determining custody? What do courts in Pennsylvania consider when determining custody? Please separate the ،ysis into two parts, NY and PA.”
With that level of detail, Paxton was able to provide an answer with a high level of confidence in its accu،.
“The Paxton AI Confidence Indicator improves the user experience by quickly s،wing the confidence and reliability of Paxton’s AI generated responses,” the company says. “The Confidence Indicator will help s،d up decision making by providing a transparent ،essment of the quality of the response.”
Free Trial
Paxton is currently offering a seven-day free trial of its ،uct, including the Confidence Indicator. After that, subscriptions s، at $79 per month per user.
منبع: https://www.lawnext.com/2024/07/paxton-ai-releases-benchmarking-data-s،wing-94-accu،-of-its-legal-research-tool-also-releases-new-confidence-indicator-feature.html