TY - JOUR
T1 - An open competition involving thousands of competitors failed to construct useful abstract classifiers for new diagnostic test accuracy systematic reviews
AU - Kataoka, Yuki
AU - Taito, Shunsuke
AU - Yamamoto, Norio
AU - So, Ryuhei
AU - Tsutsumi, Yusuke
AU - Anan, Keisuke
AU - Banno, Masahiro
AU - Tsujimoto, Yasushi
AU - Wada, Yoshitaka
AU - Sagami, Shintaro
AU - Tsujimoto, Hiraku
AU - Nihashi, Takashi
AU - Takeuchi, Motoki
AU - Terasawa, Teruhiko
AU - Iguchi, Masahiro
AU - Kumasawa, Junji
AU - Ichikawa, Takumi
AU - Furukawa, Ryuki
AU - Yamabe, Jun
AU - Furukawa, Toshi A.
N1 - Funding Information:
We would like to thank the EPPI for access to the COVID-19 ‘living’ systematic map of research. We acknowledge Mr. Yohei Ikenoue, Dr. Naoki Nishiyama, Mr. Yukihiro Toma, and Dr. Shigeru Saito in SIGNATE Corporation provided the platform and advice to set the rules to hold the competition. We acknowledge Dr. Munehiko Sasajima and Dr. Naoki Katoh at the University of Hyogo for their assistance with public relations. We also acknowledge Dr. Aki Mishima for retrieving the data.
Publisher Copyright:
© 2023 John Wiley & Sons, Ltd.
PY - 2023/9
Y1 - 2023/9
N2 - There are currently no abstract classifiers, which can be used for new diagnostic test accuracy (DTA) systematic reviews to select primary DTA study abstracts from database searches. Our goal was to develop machine-learning-based abstract classifiers for new DTA systematic reviews through an open competition. We prepared a dataset of abstracts obtained through database searches from 11 reviews in different clinical areas. As the reference standard, we used the abstract lists that required manual full-text review. We randomly splitted the datasets into a train set, a public test set, and a private test set. Competition participants used the training set to develop classifiers and validated their classifiers using the public test set. The classifiers were refined based on the performance of the public test set. They could submit as many times as they wanted during the competition. Finally, we used the private test set to rank the submitted classifiers. To reduce false exclusions, we used the Fbeta measure with a beta set to seven for evaluating classifiers. After the competition, we conducted the external validation using a dataset from a cardiology DTA review. We received 13,774 submissions from 1429 teams or persons over 4 months. The top-honored classifier achieved a Fbeta score of 0.4036 and a recall of 0.2352 in the external validation. In conclusion, we were unable to develop an abstract classifier with sufficient recall for immediate application to new DTA systematic reviews. Further studies are needed to update and validate classifiers with datasets from other clinical areas.
AB - There are currently no abstract classifiers, which can be used for new diagnostic test accuracy (DTA) systematic reviews to select primary DTA study abstracts from database searches. Our goal was to develop machine-learning-based abstract classifiers for new DTA systematic reviews through an open competition. We prepared a dataset of abstracts obtained through database searches from 11 reviews in different clinical areas. As the reference standard, we used the abstract lists that required manual full-text review. We randomly splitted the datasets into a train set, a public test set, and a private test set. Competition participants used the training set to develop classifiers and validated their classifiers using the public test set. The classifiers were refined based on the performance of the public test set. They could submit as many times as they wanted during the competition. Finally, we used the private test set to rank the submitted classifiers. To reduce false exclusions, we used the Fbeta measure with a beta set to seven for evaluating classifiers. After the competition, we conducted the external validation using a dataset from a cardiology DTA review. We received 13,774 submissions from 1429 teams or persons over 4 months. The top-honored classifier achieved a Fbeta score of 0.4036 and a recall of 0.2352 in the external validation. In conclusion, we were unable to develop an abstract classifier with sufficient recall for immediate application to new DTA systematic reviews. Further studies are needed to update and validate classifiers with datasets from other clinical areas.
UR - http://www.scopus.com/inward/record.url?scp=85162962576&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85162962576&partnerID=8YFLogxK
U2 - 10.1002/jrsm.1649
DO - 10.1002/jrsm.1649
M3 - Article
AN - SCOPUS:85162962576
SN - 1759-2879
VL - 14
SP - 707
EP - 717
JO - Research synthesis methods
JF - Research synthesis methods
IS - 5
ER -