Evaluation of Agreement in Peer Reviews of Case Reports Between Physical Therapists and ChatGPT

File
Sadaya MISAKI
Purpose:

This study aimed to evaluate the agreement between physical therapists and ChatGPT in peer reviews of case reports, to understand the potential and limitations of integrating AI into this educational process.

Methods:

The subjects were abstracts submitted to our organization's case report session in December 2023. Each abstract was peer-reviewed by two physical therapists with more than four years of clinical experience. ChatGPT version 4 was used with a prompt to simulate a Japanese physical therapist reviewing an abstract. The review covered nine criteria: originality, contribution, title and keywords, background and objectives, intervention methods, evaluation methods, analysis methods, progress and results, and discussion. Each criterion was scored on a five-point scale, and total scores were calculated. Concordance rates and weighted kappa coefficients between the reviewers were determined, followed by a Bland-Altman analysis of the total scores.

Results:

A total of 66 peer review results from 33 case report abstracts were analyzed. The average word count of the abstracts was 1,390.5 (SD: 127.9). The reviewers consisted of 25 physical therapists (21 males and 4 females), with an average of 9.6 years of experience (SD: 3.8). ChatGPT's average review time per abstract was 21.1 seconds (SD: 5.0). Concordance rates for the criteria ranged from 22.7% to 39.4%, with the highest values in contribution, analysis methods, and progress and results. The weighted kappa coefficients ranged from -0.10 to 0.17, with higher values in analysis methods, intervention methods, and progress and results. The total score averaged 32.9 (SD: 6.2) for physical therapists and 37.2 (SD: 3.0) for ChatGPT. Bland-Altman analysis showed a systematic bias with ChatGPT scoring 4.3 points (95% confidence interval: 2.5 to 6.0) higher on average, and a proportional bias was also observed as the score difference widened with higher overall scores, showed by a correlation coefficient of 0.62 (p0.001, 95% confidence interval:  0.48 to 0.73).

Conclusion(s):

The concordance between physical therapists and ChatGPT in peer review of case reports was poor to slight. ChatGPT tended to score higher than physical therapists, and the difference in total scores increased as the average scores increased. When using ChatGPT for peer review, collaboration with physical therapists is essential. Further studies include refining the application scope and review prompts for ChatGPT.

Implications:

While ChatGPT holds promise for enhancing operational efficiency in peer review processes, its application must be clearly defined to incorporate the professional perspectives of physical therapists.

Funding acknowledgements:
No funding was received to support this study.
Keywords:
ChatGPT
peer review
agreement
Primary topic:
Education: clinical
Second topic:
Education: continuing professional development
Third topic:
Education: methods of teaching and learning
Did this work require ethics approval?:
No
Has any of this material been/due to be published or presented at another national or international conference prior to the World Physiotherapy Congress 2025?:
No

Back to the listing