Can GPT-4 do as well on the bar exam as OpenAI claims?

A new research paper calls the company’s findings into question.

article cover — Pitiphothivichit/Getty Images

April 5, 2024

• 3 min read

What happens when you bring together the NiCEst person on earth and the best AI platform for customer service? You get service that just gets you. Kristen Bell + the smartest AI platform = service that’s seamless, personal, efficient… it’s so NiCE.

In what sounds like a budding lawyer’s stress dream, a new research paper picks apart OpenAI’s claim that GPT-4 managed to pass the bar exam with flying colors.

The paper, called “Re-evaluating GPT-4’s bar exam performance” and published this week in Artificial Intelligence and Law, questions both the large language model’s (LLM) score and its interpretation, as published in a pair of reports last year. It comes as the legal field has forged ahead with turning LLMs into research tools, but AI’s use in the legal profession has also led to some high-profile mistakes.

What the study says: OpenAI’s original attention-grabbing claim about GPT-4’s legal prowess traces to a technical report published by the company last year, which boasted that the LLM placed “around the top 10% of test-takers” on a “simulated bar exam.” A separate paper published before OpenAI’s report found that GPT-4 scored a 297 on the Uniform Bar Exam, high enough to pass in all jurisdictions.

But the Artificial Intelligence and Law research, authored by Eric Martínez, a PhD student at MIT and a graduate of Harvard Law School, argues that the first above claim is compared to a February batch of Illinois State Bar test-takers that “appears heavily skewed toward repeat test-takers who failed the July administration and score significantly lower than the general test-taking population.”

Recasting GPT-4’s score against test-takers from a recent July exam put the AI in the 69th percentile for the test overall and around the 48th percentile for the essay section. The AI system also scored slightly lower than that when compared to first-time test takers.

Martínez does replicate the 297 multiple-choice score with his own simulation, but calls into question the study’s methodology on essay grading. Rather than asking trained graders to evaluate the essays based on the bar organizers’ guidelines, the authors of that study compared GPT-4’s essays to “good answers” from a Maryland bar exam, Martínez wrote.

OpenAI declined to comment on the findings.

What it means: Martínez wrote that the results “raise concerns both for the transparency of capabilities research and the safety of AI development,” but could offer lawyers “at least a mild temporary sense of relief regarding the security of the profession” when it comes to the essay-writing portion.

Rob Scott, managing partner of intellectual property and IT law firm Scott & Scott and co-founder of legal AI startup Monjur, told Tech Brew that he’s found LLMs to be “very good” at summarizing and outlines, but not as adept at legal analysis, where it “frequently hallucinates the legal conclusions.”

“In the legal context, you need to know when it’s right or wrong. You need to double-check everything that it does,” Scott said. “It’s just a tool. It’s not a replacement for a lawyer. It’s a tool to be used by experienced lawyers when it comes to legal services.”

Keep up with the innovative tech transforming business

Tech Brew keeps business leaders up-to-date on the latest innovations, automation advances, policy shifts, and more, so they can make informed decisions about tech.