主題模型

Topic modeling 指的是一組從文集中抽取隱藏「主題」thematic structures 的技術方法。在許多的應用，我們都想自動抽取一篇文章或一段話所表達的「中心思想」。
最早的模型是 pLSI (probabilistic latent semantic indexing) ，後來發展的 LDA (Latent Dirichlet allocation) (LDA,潜在狄利克雷分配模型) 模型及其延伸變成了最常用的模型。LDA topic model 涉及比較深一點的數學，包括 Dirichlet distribution, 多項分佈、EM 算法、Gibbs sampling 等等。LDA是一種非監督式的機器學習技術，已經被廣泛用來識別大規模文集（document collection）或語料庫（corpus）中潛藏的主題訊息。
主題模型通常將文本表徵成詞袋 (a bag of words) (Harris,1954)，把整個文本當成是詞的集合，至於語法或任何詞序都可以忽略。

LDA

(Blei et al. 2003)

LDA 是一個貝氏機率模式，不管 PLSI 還是 LDA 都遵循以下通式:

P(w|d) = \Sigma p(w|z) * p(z|d)

這個 model 預設了：

存在 $K$ 個主題。
每一篇文檔代表了一些主題所構成的一個機率分佈，而每一個主題又代表了很多單詞所構成的一個機率分佈。In LDA, documents are represented as probability distributions over latent topics where each topic is characterized by a distribution over words.

There is a random variable that assigns each topic an associated probability distribution over words. You should think of this distribution as the probability of seeing word w given topic k. There is another random variable that assigns each document a probability distribution over topics. You should think of this distribution as the mixture of topics in document d. Each word in a document was generated by first randomly picking a topic (from the document's distribution of topics) and then randomly picking a word (from the topic's distribution of words).

由於 Dirichlet分佈隨機向量各份量間的弱相關性（之所以還有點「相關」，是因為各份量之和必須為1），使得我們假想的潛在主題之間也幾乎是不相關的，這與很多實際問題並不相符，從而造成了LDA的又一個遺留問題。

Topic model with R

mallet:
topicmodels: Topic modeling interface to the C code developed by by David M. Blei for Topic Modeling (Latent Dirichlet Allocation (LDA), and Correlated Topics Models (CTM)).
lda
LDAvis Interactive visualization of topic models.
RTextTools

library(RTextTools)
library(topicmodels)

# loading the data (the bundled NYTimes dataset contains headlines from front-page NYTimes articles)
# 隨機挑 1000 篇
data(NYTimes)
data <- NYTimes[sample(1:3100, size=1000, replace=FALSE),]

# Create a DocumentTermMatrix
# create_matrix() 建立的 dtm 可以當topimodels 的 LDA() 的input
matrix <- create_matrix(cbind(as.vector(data$Title),
            as.vector(data$Subject)), 
            language="english", 
            removeNumbers=TRUE, 
            stemWords=TRUE, 
            weighting=weightTf)

# Perform Latent Dirichlet Allocation


## First we want to determine the number of topics in our data. 
# In the case of the NYTimes dataset, the data have already been classified as a training set for supervised learning algorithms. 
# Therefore, we can use the unique() function to determine the number of unique topic categories (k) in our data.
# Next, we use our matrix and this k value to generate the LDA model.

k <- length(unique(data$Topic.Code))
lda <- LDA(matrix, k)

# View the Results
## view the results by most likely term per topic, or most likely topic per document.

terms(lda)
topics(lda)

Deep learning

(2015). Topic2Vec: Learning Distributed Representations of Topics

(2015). Topical Word Embeddings.

The probability distribution generated from LDA prefers to describe the statistical relationship of occurrences rather than real semantic information embedded in words, topics and documents.
Also LDA will assign high probabilities to high frequency words and those words with low probabilities are hard to be chosen as representatives of topics.But in practice, low probability words sometimes distinguish topics better. For example, LDA will assign higher probability and choose “food” as representative other than “cheeseburger”, “drug” other than “aricept” and “technology” other than “smartphone”.
Recently, the embedded representations have shown more effectiveness than LDA-style representations in many tasks.

把「主題」嵌入 semantic vector space 裏。

Previous文本聚類 Next立場、意圖與價值

Last updated 4 years ago