TY - GEN
T1 - An embedding method for unseen words considering contextual information and morphological information
AU - Won, Min Sub
AU - Choi, Yun Seok
AU - Kim, Samuel
AU - Na, Cheol Won
AU - Lee, Jee Hyong
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/3/22
Y1 - 2021/3/22
N2 - The performance1 of natural language processing has been greatly improved through the pre-trained language models, which are trained with a large amount of corpus. But the performance of natural language processing can be reduced by the OOV (Out of Vocabulary) problem. Recent language representation models such as BERT use sub-word tokenization that splits word into pieces, in order to deal with the OOV problem. However, since OOV words are also divided into pieces of tokens and thus represented as the weighted sum of the unusual words, it can lead to misrepresentation of the OOV words. To relax the misrepresentation problem with OOV words, we propose a character-level pre-trained language model called CCTE (Context Char Transformer Encoder). Unlike BERT, CCTE takes the entire word as an input and the word is represented by considering morphological information and contextual information. Experiments in multiple datasets showed that in NER, POS tagging tasks, the proposed model which is smaller than the existing pre-trained models generally outperformed. Especially, when there are more OOVs, the proposed method showed superior performance with a large margin. In addition, cosine similarity comparisons of word pairs showed that the proposed method properly considers morphological and contextual information of words.
AB - The performance1 of natural language processing has been greatly improved through the pre-trained language models, which are trained with a large amount of corpus. But the performance of natural language processing can be reduced by the OOV (Out of Vocabulary) problem. Recent language representation models such as BERT use sub-word tokenization that splits word into pieces, in order to deal with the OOV problem. However, since OOV words are also divided into pieces of tokens and thus represented as the weighted sum of the unusual words, it can lead to misrepresentation of the OOV words. To relax the misrepresentation problem with OOV words, we propose a character-level pre-trained language model called CCTE (Context Char Transformer Encoder). Unlike BERT, CCTE takes the entire word as an input and the word is represented by considering morphological information and contextual information. Experiments in multiple datasets showed that in NER, POS tagging tasks, the proposed model which is smaller than the existing pre-trained models generally outperformed. Especially, when there are more OOVs, the proposed method showed superior performance with a large margin. In addition, cosine similarity comparisons of word pairs showed that the proposed method properly considers morphological and contextual information of words.
KW - character-level CNN
KW - natural language processing
KW - out-of-vocabulary (OOV)
KW - transformer
KW - word embedding
UR - https://www.scopus.com/pages/publications/85105022325
U2 - 10.1145/3412841.3441982
DO - 10.1145/3412841.3441982
M3 - Conference contribution
AN - SCOPUS:85105022325
T3 - Proceedings of the ACM Symposium on Applied Computing
SP - 1055
EP - 1062
BT - Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC 2021
PB - Association for Computing Machinery
T2 - 36th Annual ACM Symposium on Applied Computing, SAC 2021
Y2 - 22 March 2021 through 26 March 2021
ER -