Objectives
Although ChatGPT was not developed for medical use, there is growing interest in its use in medical fields. Understanding its capabilities and precautions for its use in the medical field is an urgent matter. We hypothesized that differences in the amounts of information published in different medical fields would be proportionate to the amounts of training ChatGPT receives in those fields, and hence its accuracy in providing answers.
Study design
A non-clinical experimental study.
Methods
We administered the Japanese National Medical Examination to GPT-3.5 and GPT-4 to examine the rates of accuracy and consistency in their responses. We counted the total number of documents in the Web of Science Core Collection per medical field and assessed the relationship with ChatGPT’s accuracy. We also performed multivariate-adjusted models to investigate the risk factors for incorrect answers.
Results
For GPT-4, we confirmed an accuracy rate of 81.0 % and a consistency rate of 88.8 % on the exam; both showed improvement compared to those for GPT-3.5. A positive correlation was observed between the accuracy rate and consistency rate (R = 0.51, P < 0.001). The number of documents per medical field was significantly correlated with the accuracy rate in that medical field (R = 0.44, P < 0.05), with relatively few publications being an independent risk factor for incorrect answers.
Conclusions
Checking consistency may help identify incorrect answers when using ChatGPT. Users should be aware that the accuracy of the answers by ChatGPT may decrease when it is asked about topics with limited published information, such as new drugs and diseases.