TY - GEN
T1 - A Method to Generate a Machine-Labeled Data for Biomedical Named Entity Recognition with Various Sub-Domains
AU - Kim, Juae
AU - Kwon, Sunjae
AU - Ko, Youngjoong
AU - Seo, Jungyun
N1 - Publisher Copyright:
© 2017 AFNLP
PY - 2017
Y1 - 2017
N2 - Biomedical Named Entity (NE) recognition is a core technique for various works in the biomedical domain. In previous studies, using machine learning algorithm shows better performance than dictionary-based and rule based approaches because there are too many terminological variations of biomedical NEs and new biomedical NEs are constantly generated. To achieve the high performance with a machine-learning algorithm, good-quality corpora are required. However, it is difficult to obtain the good-quality corpora because annotating a biomedical corpus for machine-learning is extremely time-consuming and costly. In addition, most previous corpora are insufficient for high-level tasks because they cannot cover various domains. Therefore, we propose a method for generating a large amount of machine-labeled data that covers various domains. To generate a large amount of machine-labeled data, firstly we generate an initial machine-labeled data by using a chunker and MetaMap. The chunker is developed to extract only biomedical NEs with manually annotated data. MetaMap is used to annotate the category of biomedical NE. Then we apply the self-training approach to bootstrap the performance of initial machine-labeled data. In our experiments, the biomedical NE recognition system that is trained with our proposed machine-labeled data achieves much high performance. As a result, our system outperforms biomedical NE recognition system that using MetaMap only with 26.03%p improvements on F1-score.
AB - Biomedical Named Entity (NE) recognition is a core technique for various works in the biomedical domain. In previous studies, using machine learning algorithm shows better performance than dictionary-based and rule based approaches because there are too many terminological variations of biomedical NEs and new biomedical NEs are constantly generated. To achieve the high performance with a machine-learning algorithm, good-quality corpora are required. However, it is difficult to obtain the good-quality corpora because annotating a biomedical corpus for machine-learning is extremely time-consuming and costly. In addition, most previous corpora are insufficient for high-level tasks because they cannot cover various domains. Therefore, we propose a method for generating a large amount of machine-labeled data that covers various domains. To generate a large amount of machine-labeled data, firstly we generate an initial machine-labeled data by using a chunker and MetaMap. The chunker is developed to extract only biomedical NEs with manually annotated data. MetaMap is used to annotate the category of biomedical NE. Then we apply the self-training approach to bootstrap the performance of initial machine-labeled data. In our experiments, the biomedical NE recognition system that is trained with our proposed machine-labeled data achieves much high performance. As a result, our system outperforms biomedical NE recognition system that using MetaMap only with 26.03%p improvements on F1-score.
UR - https://www.scopus.com/pages/publications/85062560224
M3 - Conference contribution
AN - SCOPUS:85062560224
T3 - DDDSM 2017 - 1st International Workshop on Digital Disease Detection using Social Media, Proceedings of the Workshop
SP - 47
EP - 51
BT - DDDSM 2017 - 1st International Workshop on Digital Disease Detection using Social Media, Proceedings of the Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 1st International Workshop on Digital Disease Detection using Social Media, DDDSM 2017, co-located with the 8th International Joint Conference on Natural Language Processing, IJCNLP 2017
Y2 - 27 November 2017
ER -