使用nltk改进对人名的提取 - python

Improve this question

我正在尝试从文本中提取人名。

有人有推荐的方法吗?

这是我尝试过的(下面的代码):
我正在使用nltk查找标记为人的所有东西,然后生成该人所有NNP部分的列表。我正在跳过只有一个NNP可以避免抓住一个姓氏的人。

我得到了不错的结果,但想知道是否有更好的方法来解决这个问题。

码:

import nltk
from nameparser.parser import HumanName

def get_human_names(text):
    tokens = nltk.tokenize.word_tokenize(text)
    pos = nltk.pos_tag(tokens)
    sentt = nltk.ne_chunk(pos, binary = False)
    person_list = []
    person = []
    name = ""
    for subtree in sentt.subtrees(filter=lambda t: t.node == 'PERSON'):
        for leaf in subtree.leaves():
            person.append(leaf[0])
        if len(person) > 1: #avoid grabbing lone surnames
            for part in person:
                name += part + ' '
            if name[:-1] not in person_list:
                person_list.append(name[:-1])
            name = ''
        person = []

    return (person_list)

text = """
Some economists have responded positively to Bitcoin, including 
Francois R. Velde, senior economist of the Federal Reserve in Chicago 
who described it as "an elegant solution to the problem of creating a 
digital currency." In November 2013 Richard Branson announced that 
Virgin Galactic would accept Bitcoin as payment, saying that he had invested 
in Bitcoin and found it "fascinating how a whole new global currency 
has been created", encouraging others to also invest in Bitcoin.
Other economists commenting on Bitcoin have been critical. 
Economist Paul Krugman has suggested that the structure of the currency 
incentivizes hoarding and that its value derives from the expectation that 
others will accept it as payment. Economist Larry Summers has expressed 
a "wait and see" attitude when it comes to Bitcoin. Nick Colas, a market 
strategist for ConvergEx Group, has remarked on the effect of increasing 
use of Bitcoin and its restricted supply, noting, "When incremental 
adoption meets relatively fixed supply, it should be no surprise that 
prices go up. And that’s exactly what is happening to BTC prices."
"""

names = get_human_names(text)
print "LAST, FIRST"
for name in names: 
    last_first = HumanName(name).last + ', ' + HumanName(name).first
        print last_first

输出:

LAST, FIRST
Velde, Francois
Branson, Richard
Galactic, Virgin
Krugman, Paul
Summers, Larry
Colas, Nick

除了维珍银河,这都是有效的输出。当然,在本文中了解维珍银河不是人的名字是很困难的(也许是不可能的)部分。

参考方案

必须同意“使我的代码更好”这个网站不适合的建议,但是我可以给你一些方法,让你尝试挖掘

看一看Stanford Named Entity Recognizer (NER)。它的绑定已包含在NLTK v 2.0中,但是您必须下载一些核心文件。这是script,可以为您完成所有操作。

我写了这个脚本:

import nltk
from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
text = """YOUR TEXT GOES HERE"""

for sent in nltk.sent_tokenize(text):
    tokens = nltk.tokenize.word_tokenize(sent)
    tags = st.tag(tokens)
    for tag in tags:
        if tag[1]=='PERSON': print tag

并没有那么糟糕的输出:

(“Francois”,“PERSON”)
(“R。”,“PERSON”)
(“Velde”,“PERSON”)
(“Richard”,“PERSON”)
(“布兰森”,“人”)
(“处女”,“人”)
(“银河系”,“人”)
(“比特币”,“人”)
(“比特币”,“人”)
(“保罗”,“人”)
(“克鲁格曼”,“人”)
(“拉里”,“人”)
(“夏天”,“人”)
(“比特币”,“人”)
(“尼克”,“人”)
(“可乐”,“人”)

希望这会有所帮助。

Python uuid4,如何限制唯一字符的长度 - python

在Python中,我正在使用uuid4()方法创建唯一的字符集。但是我找不到将其限制为10或8个字符的方法。有什么办法吗?uuid4()ffc69c1b-9d87-4c19-8dac-c09ca857e3fc谢谢。 参考方案 尝试:x = uuid4() str(x)[:8] 输出:"ffc69c1b" Is there a way to…

Python-如何检查Redis服务器是否可用 - python

我正在开发用于访问Redis Server的Python服务(类)。我想知道如何检查Redis Server是否正在运行。而且如果某种原因我无法连接到它。这是我的代码的一部分import redis rs = redis.Redis("localhost") print rs 它打印以下内容<redis.client.Redis o…

在返回'Response'(Python)中传递多个参数 - python

我在Angular工作,正在使用Http请求和响应。是否可以在“响应”中发送多个参数。角度文件:this.http.get("api/agent/applicationaware").subscribe((data:any)... python文件:def get(request): ... return Response(seriali…

无法注释掉涉及多行字符串的代码 - python

基本上,我很好奇这为什么会引发语法错误,以及如何用Python的方式来“注释掉”我未使用的代码部分,例如在调试会话期间。''' def foo(): '''does nothing''' ''' 参考方案 您可以使用三重双引号注释掉三重单引…

Python:BeautifulSoup-根据名称属性获取属性值 - python

我想根据属性名称打印属性值,例如<META NAME="City" content="Austin"> 我想做这样的事情soup = BeautifulSoup(f) //f is some HTML containing the above meta tag for meta_tag in soup(&#…