An embedding method for unseen words considering contextual information and morphological information

Min Sub Won, Yun Seok Choi, Samuel Kim, Cheol Won Na, Jee Hyong Lee

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

The performance1 of natural language processing has been greatly improved through the pre-trained language models, which are trained with a large amount of corpus. But the performance of natural language processing can be reduced by the OOV (Out of Vocabulary) problem. Recent language representation models such as BERT use sub-word tokenization that splits word into pieces, in order to deal with the OOV problem. However, since OOV words are also divided into pieces of tokens and thus represented as the weighted sum of the unusual words, it can lead to misrepresentation of the OOV words. To relax the misrepresentation problem with OOV words, we propose a character-level pre-trained language model called CCTE (Context Char Transformer Encoder). Unlike BERT, CCTE takes the entire word as an input and the word is represented by considering morphological information and contextual information. Experiments in multiple datasets showed that in NER, POS tagging tasks, the proposed model which is smaller than the existing pre-trained models generally outperformed. Especially, when there are more OOVs, the proposed method showed superior performance with a large margin. In addition, cosine similarity comparisons of word pairs showed that the proposed method properly considers morphological and contextual information of words.

Original languageEnglish
Title of host publicationProceedings of the 36th Annual ACM Symposium on Applied Computing, SAC 2021
PublisherAssociation for Computing Machinery
Pages1055-1062
Number of pages8
ISBN (Electronic)9781450381048
DOIs
StatePublished - 22 Mar 2021
Event36th Annual ACM Symposium on Applied Computing, SAC 2021 - Virtual, Online, Korea, Republic of
Duration: 22 Mar 202126 Mar 2021

Publication series

NameProceedings of the ACM Symposium on Applied Computing

Conference

Conference36th Annual ACM Symposium on Applied Computing, SAC 2021
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period22/03/2126/03/21

Keywords

  • character-level CNN
  • natural language processing
  • out-of-vocabulary (OOV)
  • transformer
  • word embedding

Fingerprint

Dive into the research topics of 'An embedding method for unseen words considering contextual information and morphological information'. Together they form a unique fingerprint.

Cite this