Распознавание именованных сущностей NLTK в списке Python

Question

Распознавание именованных сущностей NLTK в списке Python

Я использовал НЛТК ne_chunk извлечь именованные объекты из текста:

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."


nltk.ne_chunk(my_sent, binary=True)

Но я не могу понять, как сохранить эти объекты в списке? Например -

print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')

Благодарю.

26

python nlp nltk named-entity-recognition

Источник

user3590728 05 авг '15 в 14:58

6 ответов

Решение

Вы также можете извлечь label каждого имени объекта в тексте, используя этот код:

import nltk
for sent in nltk.sent_tokenize(sentence):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))

Выход:

GPE WASHINGTON
GPE New York
PERSON Loretta E. Lynch
GPE Brooklyn

Ты можешь видеть Washington, New York а также Brooklyn являются GPE означает геополитические образования

а также Loretta E. Lynch это PERSON

22

Источник

user1361125 16 авг '17 в 03:15

Как вы получаете tree в качестве возвращаемого значения, я думаю, вы хотите выбрать те поддеревья, которые помечены NE

Вот простой пример, чтобы собрать всех тех, кто в списке:

import nltk

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True)  # POS tagging before chunking!

named_entities = []

for t in parse_tree.subtrees():
    if t.label() == 'NE':
        named_entities.append(t)
        # named_entities.append(list(t))  # if you want to save a list of tagged words instead of a tree

print named_entities

Это дает:

[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]

или в виде списка списков:

[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]

Также смотрите: Как перемещаться по nltk.tree.Tree?

8

Источник

user4094444 05 авг '15 в 15:50

Используйте tree2conlltags из nltk.chunk. Также ne_chunk нуждается в pos-тегах, которые помечают токены слов (таким образом, нужен word_tokenize).

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.chunk import tree2conlltags

sentence = "Mark and John are working at Google."
print(tree2conlltags(ne_chunk(pos_tag(word_tokenize(sentence))
"""[('Mark', 'NNP', 'B-PERSON'), 
    ('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'), 
    ('are', 'VBP', 'O'), ('working', 'VBG', 'O'), 
    ('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'), 
    ('.', '.', 'O')] """

Это даст вам список кортежей: [(token, pos_tag, name_entity_tag)] Если этот список не совсем то, что вы хотите, конечно, легче разобрать список, который вы хотите, из этого списка, чем дерево nltk.

Код и детали по этой ссылке; проверить это для получения дополнительной информации

Редактировать Добавлен вывод строки документа

6

Источник

user6623365 12 фев '18 в 01:30

Вы также можете рассмотреть возможность использования Spacy:

import spacy
nlp = spacy.load('en')

doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.')

print([ent for ent in doc.ents])

>>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]

4

Источник

user5539715 16 мар '18 в 18:40

Tree это список. Куски - это поддеревья, непоследовательные слова - это обычные строки. Итак, давайте спустимся по списку, извлечем слова из каждого куска и присоединимся к ним.

>>> chunked = nltk.ne_chunk(my_sent)
>>>
>>>  [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ]
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

3

Источник

user699305 31 май '17 в 20:45

Nltk .ne_chunk возвращает вложенный объект nltk.tree.Tree, поэтому вам нужно будет пройти через объект Tree, чтобы добраться до сетевых элементов. Вы можете использовать понимание списка, чтобы сделать то же самое.

import nltk   
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."

word = nltk.word_tokenize(my_sent)   
pos_tag = nltk.pos_tag(word)   
chunk = nltk.ne_chunk(pos_tag)   
NE = [ " ".join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]   
print (NE)

1

Источник

user11723562 24 мар '20 в 13:28

Другие вопросы по тегам python nlp nltk named-entity-recognition

user610569 05 авг '15 в 16:46 2015-08-05 16:46 · Accepted Answer · 2015-08-05 16:46

nltk.ne_chunk возвращает вложенный nltk.tree.Tree объект, так что вам придется пройти через Tree объект, чтобы добраться до NE.

Взгляните на распознавание именованных сущностей с помощью регулярного выражения: NLTK

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> from nltk.tree import Tree
>>> 
>>> def get_continuous_chunks(text):
...     chunked = ne_chunk(pos_tag(word_tokenize(text)))
...     continuous_chunk = []
...     current_chunk = []
...     for i in chunked:
...             if type(i) == Tree:
...                     current_chunk.append(" ".join([token for token, pos in i.leaves()]))
...             elif current_chunk:
...                     named_entity = " ".join(current_chunk)
...                     if named_entity not in continuous_chunk:
...                             continuous_chunk.append(named_entity)
...                             current_chunk = []
...             else:
...                     continue
...     return continuous_chunk
... 
>>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
>>> get_continuous_chunks(my_sent)
['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']