python - Doc2Vec句子聚类

原文 标签 python scikit-learn text-mining gensim doc2vec

Doc2Vec Sentence Clustering

I have multiple documents that contain multiple sentences. I want to use doc2vec to cluster (e.g. k-means) the sentence vectors by using sklearn.

As such, the idea is that similar sentences are grouped together in several clusters. However, it is not clear to me if I have to train every single document separately and then use a clustering algorithm on the sentence vectors. Or, if I could infer a sentence vector from doc2vec without training every new sentence.

Right now this is a snippet of my code:

sentenceLabeled = []
for sentenceID, sentence in enumerate(example_sentences):
    sentenceL = TaggedDocument(words=sentence.split(), tags = ['SENT_%s' %sentenceID])
    sentenceLabeled.append(sentenceL)

model = Doc2Vec(size=300, window=10, min_count=0, workers=11, alpha=0.025, 
min_alpha=0.025)
model.build_vocab(sentenceLabeled)
for epoch in range(20):
    model.train(sentenceLabeled)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
textVect = model.docvecs.doctag_syn0

## K-means ##
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(textVect)
clusters = km.labels_.tolist()

## Print Sentence Clusters ##
cluster_info = {'sentence': example_sentences, 'cluster' : clusters}
sentenceDF = pd.DataFrame(cluster_info, index=[clusters], columns = ['sentence','cluster'])

for num in range(num_clusters):
     print()
     print("Sentence cluster %d: " %int(num+1), end='')
     print()
     for sentence in sentenceDF.ix[num]['sentence'].values.tolist():
        print(' %s ' %sentence, end='')
        print()
    print()

Basically, what I am doing right now is training on every labeled sentence in the document. However, if have the idea that this could be done in a simpler way.

Eventually, the sentences that contain similar words should be clustered together and be printed. At this point training every document separately, does not clearly reveal any logic within the clusters.

Hopefully someone can steer me in the right direction. Thanks.

Answer
  • have you looked at the word vectors you get (use DM=1 algorithm setting) ? Do these show good similarities when you inspect them?
  • I would try using tSNE to reduce down your dimensions once you have got some reasonable looking similar word vectors working. You can use PCA first to do that to reduce to say 50 or so dimensions if you need to . Think both are in sklearn. Then see if your documents are forming distinct groups or not like that.
  • also look at your most_similar() document vectors and try infer_vector() on a known trained sentence and you should get a very close similarity to 1 if all is well. (infer_vector() is always a bit different result each time, so never identical!)

翻译

我有多个包含多个句子的文档。我想使用doc2vec通过使用sklearn对句子向量进行聚类(例如k均值)。

因此,我们的想法是将相似的句子分为几个类。但是,我不清楚是否必须分别训练每个文档,然后对句子向量使用聚类算法。或者,如果我可以在不训练每个新句子的情况下从doc2vec推断出句子向量。

现在,这是我的代码的一个片段:

sentenceLabeled = []
for sentenceID, sentence in enumerate(example_sentences):
    sentenceL = TaggedDocument(words=sentence.split(), tags = ['SENT_%s' %sentenceID])
    sentenceLabeled.append(sentenceL)

model = Doc2Vec(size=300, window=10, min_count=0, workers=11, alpha=0.025, 
min_alpha=0.025)
model.build_vocab(sentenceLabeled)
for epoch in range(20):
    model.train(sentenceLabeled)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
textVect = model.docvecs.doctag_syn0

## K-means ##
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(textVect)
clusters = km.labels_.tolist()

## Print Sentence Clusters ##
cluster_info = {'sentence': example_sentences, 'cluster' : clusters}
sentenceDF = pd.DataFrame(cluster_info, index=[clusters], columns = ['sentence','cluster'])

for num in range(num_clusters):
     print()
     print("Sentence cluster %d: " %int(num+1), end='')
     print()
     for sentence in sentenceDF.ix[num]['sentence'].values.tolist():
        print(' %s ' %sentence, end='')
        print()
    print()


基本上,我现在正在做的是对文档中每个带标签的句子进行培训。但是,如果有想法可以用更简单的方式完成。

最终,包含相似单词的句子应聚在一起并打印出来。在这一点上,分别培训每个文档并不能清楚地揭示集群中的任何逻辑。

希望有人可以指引我正确的方向。
谢谢。
最佳答案
您是否看过获得的单词向量(使用DM = 1算法设置)?检查它们时,它们显示出良好的相似性吗?
一旦您有了一些看起来合理的相似字向量,我就会尝试使用tSNE减小尺寸。如果需要,可以先使用PCA将尺寸减小到50左右。认为两者都在sklearn中。然后查看您的文档是否形成了不同的组。
还查看您的most_similar()文档向量,并在已知的经过训练的句子上尝试infer_vector(),如果一切正常,则应该与1非常接近。 (每次infer_vector()的结果总是有点不同,因此永远不会完全相同!)
相关推荐

python - Scrapy python csv输出的每一行之间都有空白行

python - 如何检测填充了哪些圆圈+ OpenCV + Python

python - 是否可以更改当地人的命令?

python - 为Python2和Python3编写Unicode正则表达式

python - pip安装tesserocr失败,并显示错误“ tesserocr的建筑轮失败”

python - webbrowser python在同一选项卡中打开新的URL。这行得通吗?

python - 使用Python进行轨迹聚类/聚合

python - 使用model.fit_generator时keras val非常慢

python - 在Python 2.7中模拟布尔__bool__协议方法

python - 为什么“ nparray.tolist()”占用了这么多空间?