如何在python中对Wikipedia类别进行分组? - python

对于数据集的每个概念,我都存储了相应的维基百科类别。例如,考虑以下5个概念及其对应的维基百科类别。

高甘油三酯血症:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
酶抑制剂:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
搭桥手术:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
珀斯:['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
气候:['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语)。

更确切地说,我想将我的概念分为医学和非医学领域。但是,仅使用类别来划分概念非常困难。例如,即使enzyme inhibitorbypass surgery这两个概念在医学领域,它们的类别也非常不同。

因此,我想知道是否有一种方法可以获取类别的parent category(例如,enzyme inhibitorbypass surgery的类别属于medical父类别)

我当前正在使用pymediawikipywikibot。但是,我不仅限于这两个库,并且很高兴也可以使用其他库来解决。

编辑

正如@IlmariKaronen所建议的,我也使用了categories of categories,得到的结果如下(category附近的小字体是categories of the category)。
如何在python中对Wikipedia类别进行分组? - python

但是,我仍然找不到使用这些类别详细信息来确定给定术语是医学术语还是非医学术语的方法。

此外,正如@IlmariKaronen指出的,使用Wikiproject细节可能是潜在的。但是,似乎Medicine wikiproject似乎没有所有医学术语。因此,我们还需要检查其他wikiproject。

编辑:
我当前从Wikipedia概念中提取类别的代码如下。可以使用pywikibotpymediawiki如下进行操作。

使用库pymediawiki

导入mediawiki为pw

p = wikipedia.page('enzyme inhibitor')
print(p.categories)

使用库pywikibot

import pywikibot as pw

site = pw.Site('en', 'wikipedia')

print([
    cat.title()
    for cat in pw.Page(site, 'support-vector machine').categories()
    if 'hidden' not in cat.categoryinfo
])

类别的类别也可以通过@IlmariKaronen的答案中所示的相同方法进行。

如果您正在寻找更长的测试概念列表,我在下面提到了更多示例。

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']

对于非常长的列表,请检查下面的链接。 https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注意:我不希望该解决方案能100%起作用(如果所提出的算法能够检测到许多对我足够的医学概念)

如果需要,我很乐意提供更多详细信息。

参考方案

解决方案概述

好吧,我将从多个方向解决这个问题。这里有一些很好的建议,如果我是您,我将使用这些方法的组合(多数表决,预测标签,在您的二元案例中,超过50%的分类器都同意)。

我正在考虑以下方法:

主动学习(我下面提供的示例方法)
MediaWiki backlinks作为@TavoGC的答案提供
@Stanislav Kralin和/或parent categories提供的@Meena Nagarajan作为对您的问题的注释提供的SPARQL祖先类别(这两个类别可能会基于它们的差异而单独成为一个集合,但为此您必须联系两个创建者并比较他们的结果)。

这样,三分之二的人就必须同意某个概念是医学上的概念,这可以最大程度地减少错误的可能性。

当我们讨论它时,我会反对@ananand_v.singh在this answer中提出的方法,因为:

距离度量不应该是欧几里德式的,余弦相似性度量要好得多(例如,用spaCy使用),因为它不考虑向量的大小(并且不应该这样,它是对word2vec或GloVe进行训练的方式)
如果我理解正确,将会创建许多人工簇,而我们仅需要两个簇:医学和非医学簇。此外,药物的质心不以药物本身为中心。这带来了其他问题,比如说质心远离药物,并且其他词,例如computerhuman(或您认为不适合医学的其他词)可能会进入群集。
很难评估结果,甚至更严格地说,这是主观的。此外,单词向量很难可视化和理解(对于许多单词,使用PCA / TSNE /类似物将它们投射到较低的尺寸[2D / 3D]中,会给我们带来完全无意义的结果[是的,我尝试这样做,PCA对于较长的数据集,大约有5%的解释方差,真的,真的很低])。

基于上面突出显示的问题,我提出了使用active learning的解决方案,这是解决此类问题的一种非常被遗忘的方法。

主动学习法

在机器学习的这一子集中,当我们很难提出确切的算法时(例如,一个术语成为medical类别的一部分意味着什么),我们要求人类“专家”(实际上并不是必须是专家)以提供一些答案。

知识编码

正如anand_v.singh所指出的,词向量是最有前途的方法之一,我也将在这里使用它(尽管与IMO不同,它的使用也更加简洁)。

我不会在回答中重复他的观点,因此我将加两分钱:

请勿使用上下文化的词嵌入作为当前可用的最新技术水平(例如BERT)
检查您有多少个概念没有表示形式(例如,表示为零的向量)。应该选中它(并在我的代码中选中它,到时候再进行讨论),您可以使用其中包含大多数嵌入内容。

使用spaCy衡量相似度

此类用于度量编码为spaCy的GloVe单词向量的medicine与其他所有概念之间的相似性。

class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)

该代码将为每个概念返回一个数字,以衡量其与质心的相似程度。此外,它记录缺少其表示形式的概念的索引。可以这样称呼它:

import json
import typing

import numpy as np
import spacy

nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
    concepts
)

您可以用数据代替new_concepts.json

查看spacy.load,注意我已经使用过en_vectors_web_lg。它由685.000个唯一的单词向量组成(很多),并且可能针对您的情况开箱即用。安装spaCy后,您必须单独下载它,以上链接中提供了更多信息。

另外,您可能要使用多个质心词,例如添加diseasehealth之类的单词,并将其单词向量平均。我不确定这是否会对您的案件产生积极影响。

其他可能性可能是使用多个质心并计算每个概念与多个质心之间的相似度。在这种情况下,我们可能会有一些阈值,这可能会删除一些false positives,但可能会漏掉一些可能被认为与medicine相似的术语。此外,这会使情况变得更加复杂,但是如果您的结果不令人满意,则应考虑上述两个选项(并且只有在这些选择的情况下,不要事先考虑就不要采用这种方法)。

现在,我们对概念的相似性进行了粗略的衡量。但是,某个概念与医学有0.1的积极相似性意味着什么?这是应该归类为医学的概念吗?也许那已经太遥远了?

询问专家

要获得阈值(以下术语将被视为非医学术语),最简单的方法是要求人类为我们分类一些概念(这就是主动学习的目的)。是的,我知道这是一种非常简单的主动学习形式,但无论如何我都会认为。

我用sklearn-like接口编写了一个类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)为止。

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        max_steps: int,
        samples: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.max_steps: int = max_steps
        self.samples: int = samples
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

samples参数描述了在每次迭代过程中将向专家显示多少示例(这是最大值,如果已经请求了样本或样本不足以显示样本,则返回的将更少)。
step表示每次迭代中的阈值下降(我们从1开始表示完全相似)。
change_multiplier-如果专家回答的概念不相关(或大部分不相关,则返回多个),则将步乘以该浮点数。它用于在每次迭代中确定step变化之间的准确阈值。
根据概念的相似性对概念进行排序(概念越相似,则越高)

下面的函数要求专家提出意见,并根据其答案找到最佳阈值。

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

示例问题如下所示:

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

...解析专家的答案:

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

最后是ActiveLearner的完整代码,它相应地为专家找到了最佳的相似阈值:

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self

总而言之,您将不得不手动回答一些问题,但是我认为这种方法更加准确。

此外,您不必遍历所有样本,而只是其中的一小部分。您可以决定构成医学术语的样本数量(是否显示了40个医学样本和10个非医学样本,是否仍应视为医学术语?),因此您可以根据自己的喜好微调此方法。如果存在异常值(例如,50个样本中有1个是非医学样本),我认为该阈值仍然有效。

再一次:此方法应与其他方法混合使用,以最大程度地减少错误分类的机会。

分类器

当我们从专家那里获得阈值时,分类将是瞬时的,这是一个简单的分类类:

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

为了简洁起见,这是最终的源代码:

import json
import typing

import numpy as np
import spacy


class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)


class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self


class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


if __name__ == "__main__":
    nlp = spacy.load("en_vectors_web_lg")

    centroid = nlp("medicine")

    concepts = json.load(open("concepts_new.txt"))
    concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
        concepts
    )

    learner = ActiveLearner(
        np.array(concepts), concepts_similarity, samples=20, max_steps=50
    ).fit()
    print(f"Found threshold {learner.threshold_}\n")

    classifier = Classifier(centroid, learner.threshold_)
    pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
    predictions = classifier.predict(pipe)
    print(
        "\n".join(
            f"{concept}: {label}"
            for concept, label in zip(concepts[20:40], predictions[20:40])
        )
    )

在回答了一些问题之后,将阈值设为0.1([-1, 0.1)之间的所有内容均被视为非医疗性质,而[0.1, 1]之间的所有内容均被视为医疗性质),我得到以下结果:

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True

如您所见,这种方法远非完美,因此上一节描述了可能的改进:

可能的改进

如开头所述,将我的方法与其他答案混合使用,可能会排除诸如sport shoe属于medicine的想法,而主动学习方法在上述两种启发式方法之间平局的情况下将更具决定性。

我们也可以创建一个活跃的学习合奏。而不是一个阈值(例如0.1),我们将使用多个阈值(增加或减少),假设它们是0.1, 0.2, 0.3, 0.4, 0.5

假设sport shoe得到,对于每个阈值,它们分别是这样的True/False

True True False False False

进行多数表决,我们将在2票中以3标记为non-medical。此外,如果阈值低于它,我也可以缓解过于严格的阈值(如果True/False看起来像这样:True True True False False)。

我想出了可能的最终改进:在上面的代码中,我使用了Doc vector,这是单词vector创造此概念的意思。假设缺少一个单词(由零组成的矢量),在这种情况下,它将被推离medicine重心。您可能不希望这样做(因为某些小众医学术语[诸如gpv的缩写或其他缩写]可能会缺少它们的表示形式),在这种情况下,您只能平均那些与零不同的向量。

我知道这篇文章很长,因此,如果您有任何问题,请在下面发布。

在返回'Response'(Python)中传递多个参数 - python

我在Angular工作,正在使用Http请求和响应。是否可以在“响应”中发送多个参数。角度文件:this.http.get("api/agent/applicationaware").subscribe((data:any)... python文件:def get(request): ... return Response(seriali…

Python exchangelib在子文件夹中读取邮件 - python

我想从Outlook邮箱的子文件夹中读取邮件。Inbox ├──myfolder 我可以使用account.inbox.all()阅读收件箱,但我想阅读myfolder中的邮件我尝试了此页面folder部分中的内容,但无法正确完成https://pypi.python.org/pypi/exchangelib/ 参考方案 您需要首先掌握Folder的myfo…

R'relaimpo'软件包的Python端口 - python

我需要计算Lindeman-Merenda-Gold(LMG)分数,以进行回归分析。我发现R语言的relaimpo包下有该文件。不幸的是,我对R没有任何经验。我检查了互联网,但找不到。这个程序包有python端口吗?如果不存在,是否可以通过python使用该包? python参考方案 最近,我遇到了pingouin库。

AttributeError:'AnonymousUserMixin'对象没有属性'can' - python

烧瓶学习问题为了定制对匿名用户的要求,我在模型中设置了一个类: class MyAnonymousUser(AnonymousUserMixin): def can(self, permissions): return False def is_administrator(self): return False login_manager.anonymous…

如何用'-'解析字符串到节点js本地脚本? - python

我正在使用本地节点js脚本来处理字符串。我陷入了将'-'字符串解析为本地节点js脚本的问题。render.js:#! /usr/bin/env -S node -r esm let argv = require('yargs') .usage('$0 [string]') .argv; console.log(argv…