To compare the agreement, speed and reliability between manual screening and the supervised machine-learning approach to identifying relevant physiotherapy clinical trials.
We used two different approaches to perform title/abstract screening on 525 newly published articles yielded by a targeted search of ten databases over a three-day period. The first was the traditional human approach where two people independently screened title/abstract to determine eligibility based on: i) comparison of at least two interventions ii) relevance of participants and at least one intervention to physiotherapy practice, and iii) random or intended-to-be-random allocation. A third person resolved disagreements. The second was the machine-learning approach. We used Document Classification and Topic Extraction Resource (DoCTER)1, an online supervised machine-learning platform where we imported previously manually indexed articles as training data to predict whether a record should be included. We used Cohen’s kappa to assess the agreement between human and machine-learning approaches. We varied the sizes of training data to train the model (n=25, 50, 100, 150, 200, 250, and 300 records) and assessed whether this impacted the model’s agreement with human reviewers. Precision (true positives accuracy), recall (sensitivity to identify true positives), F1 score (harmonic mean of precision and recall) and time spent were compared between human and machine-learning approaches.
The agreement between the human and machine-learning models varied between fair and moderate. When comparing human screening with the prediction from a training dataset of 25 records, Cohen's kappa was 0.32 (z=8.4, p0.0001), indicating fair agreement. The level of agreement varied in different training data sizes, with Cohen's kappa 0.25, 0.43, 0.33, 0.28, 0.40, 0.49, when the training dataset included 50, 100, 150, 200, 250, and 300 records, respectively. Among all the training datasets, the dataset with 300 records had the highest precision (0.51) and F1 score (0.64). The training datasets with 150 and 200 records had the best recall score (0.94). The manual screening approach requires 12 hours, whereas preparing 150 or 300 records as a training dataset took approximately 29% and 57% of the time, respectively.
There was fair to moderate agreement between a human and a machine-learning model performing screening to identify clinical trials relevant to physiotherapy. The machine-learning approach demonstrates some capability to identify relevant records. The high recall rate indicates that the machine learning approach can efficiently reduce the overall number of records requiring manual screening.
Incorporating a machine-learning approach could substantially reduce the workload associated with screening potentially eligible physiotherapy trials for systematic reviews and evidence resources such as PEDro. While the model demonstrated potential for improving screening efficiency, the trade-off between filtering irrelevant and missing relevant articles must be carefully considered when conducting systematic reviews.
automation tools