计算经过训练的文档集上的查询字符串的TF-IDF - python

我有一个代码，可以计算150个文档的TF-IDF矩阵。

import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
all_lines = []
all_lines_corrected = []
with open("Extracted Functional Goals - Stemmed.txt") as f:
    for line in f:
        temp = line.split(None,1)
        all_lines.append(temp[1])


f.close()
for a in range(len(all_lines)-1):
    all_lines_corrected.append(all_lines[a][:-2])

all_lines_corrected.append(all_lines[len(all_lines)-1])

stop_words = stopwords.words('english')
tf = TfidfVectorizer(analyzer='word', stop_words = stop_words)
tfidf_matrix =  tf.fit_transform(all_lines_corrected).todense()
query_string = raw_input("Enter string : ")

如何获取查询字符串的TF-IDF？ (我们是否可以认为它看起来像是150个受过培训的文件的输入？

python大神给出的解决方案

您可以使用values = tf.transform([query_string])获取查询字符串的tf-idf值。结果将是带有1行N列的sparse matrix，其中这些列是矢量化器在训练文档中看到的N个唯一单词的tfidf值。

简短的示例，类似于您的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
all_lines = ["This is an example doc", "Another short example document .", "Just a third example"]

tf = TfidfVectorizer(analyzer='word')
tfidf_matrix =  tf.fit_transform(all_lines)
query_string = "This is a short example string"
print "Query String:"
print tf.transform([query_string])
print "Example doc:"
print tf.transform(["This is a short example doc"])

输出:

Query String:
  (0, 9)        0.546454011634
  (0, 7)        0.546454011634
  (0, 5)        0.546454011634
  (0, 4)        0.32274454218
Example doc:
  (0, 9)        0.479527938029
  (0, 7)        0.479527938029
  (0, 5)        0.479527938029
  (0, 4)        0.283216924987
  (0, 2)        0.479527938029

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在…

腾讯的同事天天给我安利让我看《三体》，说马化腾和雷军也在看。自己强行看了两个月，全部给看完了。感觉这文笔也就我读初中的水平……而且写着国内的一些情况，外国人能理解吗？这书为什么会这么火？这水平我也可以去写呀[笑哭][笑哭][笑哭] 招商银行员工：可以写赶紧写一个啊，能拿科幻文学雨果奖。包清白：哦楼主：pei ！tui ！你也配姓龙楼主：@赵龙王呵呵 […]