我有一个文本列的Pandas数据框。我想计算一下本专栏中最常用的短语。
例如,从文本中可以看到a very good movie
,last night
等短语出现的时间很多。
我认为有一种定义n-gram的方法,例如,短语在3到5个词之间,但是我不知道该怎么做。
import pandas as pd
text = ['this is a very good movie that we watched last night',
'i have watched a very good movie last night',
'i love this song, its amazing',
'what should we do if he asks for it',
'movie last night was amazing',
'a very nice song was played',
'i would like to se a good show',
'a good show was on tv last night']
df = pd.DataFrame({"text":text})
print(df)
所以我的目标是对很多次出现的短语(3-5个词)进行排名
参考方案
首先将列表理解中的split
文本平展为vals
,然后创建ngrams
,传递给Series
并最后使用Series.value_counts
:
from nltk import ngrams
vals = [y for x in df['text'] for y in x.split()]
n = [3,4,5]
a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
(a, good, show) 2
(movie, last, night) 2
(a, very, good) 2
(last, night, i) 2
(a, very, good, movie) 2
..
(should, we, do) 1
(a, very, nice, song, was) 1
(asks, for, it, movie, last) 1
(this, song,, its, amazing, what) 1
(i, have, watched, a) 1
Length: 171, dtype: int64
或者如果元组应该通过空格连接:
n = [3,4,5]
a = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
last night i 2
a good show 2
a very good movie 2
very good movie 2
movie last night 2
..
its amazing what should 1
watched last night i have 1
to se a 1
very good movie last night 1
a very nice song was 1
Length: 171, dtype: int64
Counter
的另一个想法:
from nltk import ngrams
from collections import Counter
vals = [y for x in df['text'] for y in x.split()]
c = Counter([' '.join(y) for x in [3,4,5] for y in ngrams(vals, x)])
df1 = pd.DataFrame({'ngrams': list(c.keys()),
'count': list(c.values())})
print (df1)
ngrams count
0 this is a 1
1 is a very 1
2 a very good 2
3 very good movie 2
4 good movie that 1
.. ... ...
166 show a good show was 1
167 a good show was on 1
168 good show was on tv 1
169 show was on tv last 1
170 was on tv last night 1
[171 rows x 2 columns]
在返回'Response'(Python)中传递多个参数 - python我在Angular工作,正在使用Http请求和响应。是否可以在“响应”中发送多个参数。角度文件:this.http.get("api/agent/applicationaware").subscribe((data:any)... python文件:def get(request): ... return Response(seriali…
R'relaimpo'软件包的Python端口 - python我需要计算Lindeman-Merenda-Gold(LMG)分数,以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是,我对R没有任何经验。我检查了互联网,但找不到。这个程序包有python端口吗?如果不存在,是否可以通过python使用该包? python参考方案 最近,我遇到了pingouin库。
如何用'-'解析字符串到节点js本地脚本? - python我正在使用本地节点js脚本来处理字符串。我陷入了将'-'字符串解析为本地节点js脚本的问题。render.js:#! /usr/bin/env -S node -r esm let argv = require('yargs') .usage('$0 [string]') .argv; console.log(argv…
Python:传递记录器是个好主意吗? - python我的Web服务器的API日志如下:started started succeeded failed 那是同时收到的两个请求。很难说哪一个成功或失败。为了彼此分离请求,我为每个请求创建了一个随机数,并将其用作记录器的名称logger = logging.getLogger(random_number) 日志变成[111] started [222] start…
Python-Excel导出 - python我有以下代码:import pandas as pd import requests from bs4 import BeautifulSoup res = requests.get("https://www.bankier.pl/gielda/notowania/akcje") soup = BeautifulSoup(res.cont…