Reconocimiento de entidades y extracción de relaciones
30 min | Última modificación: Diciembre 9, 2020
Text Analytics with Python
[1]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('ieer')
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] /root/nltk_data...
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data] Package words is already up-to-date!
[nltk_data] Downloading package ieer to /root/nltk_data...
[nltk_data] Package ieer is already up-to-date!
[1]:
True
[2]:
##
## Reconocimiento de entidades en una frase
##
sent = nltk.corpus.treebank.tagged_sents()[22]
##
## El parametro binary=True hace que las entidades
## sean taggeadas unicamente como NE
##
print(nltk.ne_chunk(sent, binary=True))
(S
The/DT
(NE U.S./NNP)
is/VBZ
one/CD
of/IN
the/DT
few/JJ
industrialized/VBN
nations/NNS
that/WDT
*T*-7/-NONE-
does/VBZ
n't/RB
have/VB
a/DT
higher/JJR
standard/NN
of/IN
regulation/NN
for/IN
the/DT
smooth/JJ
,/,
needle-like/JJ
fibers/NNS
such/JJ
as/IN
crocidolite/NN
that/WDT
*T*-1/-NONE-
are/VBP
classified/VBN
*-5/-NONE-
as/IN
amphobiles/NNS
,/,
according/VBG
to/TO
(NE Brooke/NNP)
T./NNP
Mossman/NNP
,/,
a/DT
professor/NN
of/IN
pathlogy/NN
at/IN
the/DT
(NE University/NNP)
of/IN
(NE Vermont/NNP College/NNP)
of/IN
(NE Medicine/NNP)
./.)
[3]:
##
## El parametro binary=False hace que las entidades
## sean taggeadas adicionando categorias:
##
## ORGANIZATION
## LOCATION
## DATE
## TIME
## MONEY
## PERCENT
## FACILITY human-made artifacts in the domains of architecture and civil engineering
## GPE geo-political entities such as city, state/province, and country.
##
print(nltk.ne_chunk(sent))
(S
The/DT
(GPE U.S./NNP)
is/VBZ
one/CD
of/IN
the/DT
few/JJ
industrialized/VBN
nations/NNS
that/WDT
*T*-7/-NONE-
does/VBZ
n't/RB
have/VB
a/DT
higher/JJR
standard/NN
of/IN
regulation/NN
for/IN
the/DT
smooth/JJ
,/,
needle-like/JJ
fibers/NNS
such/JJ
as/IN
crocidolite/NN
that/WDT
*T*-1/-NONE-
are/VBP
classified/VBN
*-5/-NONE-
as/IN
amphobiles/NNS
,/,
according/VBG
to/TO
(PERSON Brooke/NNP T./NNP Mossman/NNP)
,/,
a/DT
professor/NN
of/IN
pathlogy/NN
at/IN
the/DT
(ORGANIZATION University/NNP)
of/IN
(PERSON Vermont/NNP College/NNP)
of/IN
(GPE Medicine/NNP)
./.)
[4]:
##
## Las relaciones son extraidas como una tripleta (X, a, Y)
## donde X y Y son entidades y `a` representa la relación
##
import re
IN = re.compile(r".*\bin\b(?!\b.+ing)")
for doc in nltk.corpus.ieer.parsed_docs("NYT_19980315"):
for rel in nltk.sem.extract_rels("ORG", "LOC", doc, corpus="ieer", pattern=IN):
print(nltk.sem.rtuple(rel))
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']
[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham'] 'in' [LOC: 'New York']
[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']
[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']