Google's medical LLM proves increasing accuracy

In a study published in Nature, Google revealed that its generative AI technology answered medically related questions with 92.6% accuracy.
By Jessica Hagen
12:36 pm

Photo: Eugenio Marongiu/Getty Images


A study performed by Google researchers and published in Nature reveals the tech giant's generative AI technology Med-PaLM provided long-form answers aligned with scientific consensus on 92.6% of questions submitted, which is in line with clinician-generated answers at 92.9%.

Med-PaLM is a generative AI technology that utilizes Google's LLMs to answer medical questions.

Researchers utilized MultiMedQA, a standard combining six existing medical question datasets spanning the scope of research, professional medicine and consumer queries, and HealthSearchQA, a dataset of commonly searched medical questions. 

MultiMedQA questions were put through PaLM, a 540-billion parameter LLM, and Flan-PaLM, its instruction-tuned variant. 

Answers were then put through human evaluations to assess comprehension, reasoning, factuality, and possible harm and bias. 

Using various prompting strategies, Flan-PaLM proved to show accuracy in answering the MultiMedQA dataset, with 67.6% accuracy on U.S. Medical Licensing Exam-type questions, surpassing the previous accuracy levels by 17%. Still, researchers noted key gaps in its answers to consumer medical questions. 

Therefore, researchers introduced instruction prompt tuning, a data- and parameter-efficient alignment technique, resulting in Med-PaLM, which revealed substantially more accurate answers (92.9%) than Flan-PaLM (61.9%). 

Flan-PaLM answers were also rated as potentially leading to harmful outcomes 29.7% of the time compared to 5.9% of the time for Med-PaLM. The inaccuracy of clinician-generated answers was similar to Med-PaLM at 5.7%.  

Researchers acknowledged that many limitations still need to be overcome before the models are viable for clinical use, and further evaluation is necessary, particularly regarding safety, bias and equity.  

"Our hope is LLM systems such as Med-PaLM, that are designed for medical applications with safety as paramount, will democratize access to high-quality medical information, particularly in geographies with a limited number of medical professionals," Vivek Natarajan, AI researcher at Google and one of the researchers in the study, said on LinkedIn

"And eventually, with further development, rigorous validation of safety and efficacy, we hope Med-PaLM will find broad uptake in direct care pathways – augmenting our clinicians, reducing their administrative burden, aid with clinical decision making, giving them more time to focus on patients and overall make healthcare more accessible, equitable, safer and humane."


In March, the technology company's Med-PaLM 2 tested on U.S. Medical Licensing Examination-style questions, performing at an "expert" test-taker level with 85%+ accuracy. It also received a passing score on the MedMCQA dataset, a multiple-choice dataset designed to address real-world medical entrance exam questions. 

One month later, the company announced Med-PaLM 2 would be available to select Google Cloud customers in the coming weeks to share feedback, explore use cases and conduct limited testing. 

The company also announced a new AI-enabled Claims Acceleration Suite, created to help with the process of prior authorization and claims processing for health insurance. The Suite converts unstructured data (datasets not organized in a predefined manner) into structured data (datasets highly organized and easily decipherable).