Keyword extraction for blogs based on content richness

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

In this paper, a method is proposed to extract topic keywords of blogs, based on the richness of content. If a blog includes rich content related to a topic word, the word can be considered as a keyword of the blog. For this purpose, a new measure, richness, is proposed, which indicates how much a blog covers the trendy subtopics of a keyword. In order to obtain trendy subtopics of keywords, we use outside topical context data - the web. Since the web includes various and trendy information, we can find popular and trendy content related to a topic. For each candidate keyword, a set of web documents is retrieved by Google, and the subtopics found in the web documents are modelled by a probabilistic approach. Based on the subtopic models, the proposed method evaluates the richness of blogs for candidate keywords, in terms of how much a blog covers the trendy subtopics of keywords. If a blog includes various contents on a word, the word needs to be chosen as one of the keywords of the blog. In the experiments, the proposed method is compared with various methods, and shows better results, in terms of hit count, trendiness and consistency.

Original languageEnglish
Pages (from-to)38-49
Number of pages12
JournalJournal of Information Science
Volume40
Issue number1
DOIs
StatePublished - Feb 2014

Keywords

  • Blogs
  • information retrieval
  • keyword extraction
  • LDA
  • subtopic model
  • text mining

Fingerprint

Dive into the research topics of 'Keyword extraction for blogs based on content richness'. Together they form a unique fingerprint.

Cite this