Operaciones básicas sobre texto usando NLTK

  • 30 min | Última modificación: Noviembre 29, 2020

[1]:
import nltk

nltk.download("stopwords")
nltk.download("punkt")
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[1]:
True
[2]:
##
## Preparacion de los datos
##
import pandas as pd

data = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/scopus-abstracts.csv",
    sep=",",
    thousands=None,
    decimal=".",
    encoding="utf-8",
)
data.columns
[2]:
Index(['DOI', 'Link', 'Abstract'], dtype='object')
[3]:
##
## Numero de registros
##
len(data)
[3]:
1902
[4]:
##
## Ejemplo de un abstract
##
data.Abstract[0]
[4]:
'Mobility is one of the fundamental requirements of human life with significant societal impacts including productivity, economy, social wellbeing, adaptation to a changing climate, and so on. Although human movements follow specific patterns during normal periods, there are limited studies on how such patterns change due to extreme events. To quantify the impacts of an extreme event to human movements, we introduce the concept of mobility resilience which is defined as the ability of a mobility system to manage shocks and return to a steady state in response to an extreme event. We present a method to detect extreme events from geo-located movement data and to measure mobility resilience and transient loss of resilience due to those events. Applying this method, we measure resilience metrics from geo-located social media data for multiple types of disasters occurred all over the world. Quantifying mobility resilience may help us to assess the higher-order socio-economic impacts of extreme events and guide policies towards developing resilient infrastructures as well as a nation’s overall disaster resilience strategies. © 2019, The Author(s).'
[5]:
##
## Algunos abstracts tienen la marca de copyright + año + 'The Author(s).'
## Se remueve
##
data["Abstract"] = data.Abstract.map(
    lambda w: w[0 : w.find("\u00a9")], na_action="ignore"
)
data.Abstract[0]
[5]:
'Mobility is one of the fundamental requirements of human life with significant societal impacts including productivity, economy, social wellbeing, adaptation to a changing climate, and so on. Although human movements follow specific patterns during normal periods, there are limited studies on how such patterns change due to extreme events. To quantify the impacts of an extreme event to human movements, we introduce the concept of mobility resilience which is defined as the ability of a mobility system to manage shocks and return to a steady state in response to an extreme event. We present a method to detect extreme events from geo-located movement data and to measure mobility resilience and transient loss of resilience due to those events. Applying this method, we measure resilience metrics from geo-located social media data for multiple types of disasters occurred all over the world. Quantifying mobility resilience may help us to assess the higher-order socio-economic impacts of extreme events and guide policies towards developing resilient infrastructures as well as a nation’s overall disaster resilience strategies. '
[6]:
##
## Longitud de los abstracts en caracteres
##  Colores diponibles en matplotlib: https://matplotlib.org/3.1.0/gallery/color/named_colors.html
##
import matplotlib.pyplot as plt

data.Abstract.map(lambda w: len(w), na_action="ignore").plot.hist(
    color="darkorange", alpha=0.6, rwidth=0.8, edgecolor="k", figsize=(9, 5)
)

plt.gca().spines["left"].set_color("lightgray")
plt.gca().spines["bottom"].set_color("gray")
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
../../../_images/ciencia_datos_mineria_de_texto_notebooks_1-01_operaciones_basicas_sobre_texto_6_0.png
[7]:
##
## Longitud de los abstracts en palabras
##
data.Abstract.map(lambda w: len(w.split()), na_action="ignore").plot.hist(
    color="darkorange", alpha=0.6, rwidth=0.8, edgecolor="k", figsize=(9, 5)
)

plt.Figure(figsize=(8, 4))
plt.gca().spines["left"].set_color("lightgray")
plt.gca().spines["bottom"].set_color("gray")
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
../../../_images/ciencia_datos_mineria_de_texto_notebooks_1-01_operaciones_basicas_sobre_texto_7_0.png
[8]:
##
## Busqueda de abstracts con una cadena en particular
##
data.Abstract[data.Abstract.map(lambda w: "mobility" in w.lower(), na_action="ignore")]
[8]:
0       Mobility is one of the fundamental requirement...
10      The tendency of people to form socially cohesi...
54      In recent years, mobility data from smart card...
95      The influence of urban design on economic vita...
111     This study demonstrates the use of mobile phon...
188     Customer profiles that include gender and age ...
209     To measure job accessibility, person-based app...
235     In this research, we exploit repeated parts in...
236     Tourist flows in historical cities are continu...
239     It is well reported that long commutes have a ...
242     Nowadays, Location-Based Social Networks (LBSN...
244     In the last decades, the notion that cities ar...
251     In Latin America, shopping malls seem to offer...
253     Traditional crime prediction models based on c...
255     Human mobility always had a great influence on...
257     In this paper, we follow the short-ranged Syri...
262     Epidemic outbreaks are an important healthcare...
263     Billions of users of mobile phones, social med...
265     A multi-modal transportation system of a city ...
266     Estimating revenue and business demand of a ne...
275     Predictive models for human mobility have impo...
582     Understanding and modeling the mobility of ind...
587     Pokémon Go, a location-based game that uses au...
597     Next place prediction algorithms are invaluabl...
666     Walking is a form of active transportation wit...
845     China’s economic reforms of 1978, which led to...
865     Big data is among the most promising research ...
870     Predicting human mobility flows at different s...
886     Tourism is becoming a significant contributor ...
891     Whenever someone makes or receives a call on a...
894     The exploration of people’s everyday life has ...
946     Cloud storage services have become ubiquitous....
1056    In recent years, we have seen scientists attem...
1061    One of the greatest concerns related to the po...
1065    Transportation planning is strongly influenced...
1066    Customers mobility is dependent on the sophist...
1102    The wealth of information provided by real-tim...
1134    Geospatial big data refers to spatial data set...
1164    The consumerization of information technology ...
1167    Human mobility in a city represents a fascinat...
1169    There is an increasing trend of people leaving...
1300    The newly released Orange D4D mobile phone dat...
1325    This study leverages mobile phone data to anal...
1580    The mobile cellular systems are expected to su...
1625    In the age of mobile computing where users can...
Name: Abstract, dtype: object
[9]:
##
## word tokenizer
##   Separación de las frases en palabras
##
from nltk.tokenize import word_tokenize

tokens = data.Abstract.map(word_tokenize)

# primeros 20 tokens del primer abstract
tokens[0][:20]
[9]:
['Mobility',
 'is',
 'one',
 'of',
 'the',
 'fundamental',
 'requirements',
 'of',
 'human',
 'life',
 'with',
 'significant',
 'societal',
 'impacts',
 'including',
 'productivity',
 ',',
 'economy',
 ',',
 'social']
[10]:
##
## Concordancias
##   Muestra las palabras en el contexto de una frase
##
abstracts = data.Abstract.copy()
abstracts = abstracts.dropna()
abstracts = abstracts.map(lambda w: w.strip())
abstracts = abstracts.map(lambda w: w + "." if w[-1] != "." else w)
abstracts = abstracts.tolist()
abstracts = " ".join(abstracts)

abstracts = word_tokenize(abstracts)
abstracts = nltk.Text(abstracts)
abstracts.concordance("human")
Displaying 25 of 230 matches:
                                     human life with significant societal impac
nging climate , and so on . Although human movements follow specific patterns d
y the impacts of an extreme event to human movements , we introduce the concept
-scale online aggregators of offline human biases ? Often portrayed as easy-to-
e seek alternative features based on human behavior that might explain part of
ctivity of individual investors . In human relations individuals ’ gender and a
ed caller and callee combinations of human interactions , namely male to male ,
endship give rise to a wide scope of human sociality . Here we analyse the rela
ts represent a new view of worldwide human behavior and a new application of ma
owever , this task relies heavily on human observers in the affected locations
utomating this process , the risk of human error is also eliminated . Compared
data . AI is taking over the job ’ s human do , receptionists , drivers , chefs
he rights and ethics of AI ” ? . The human race is on an inevitable path of AI
proposed method reduces the need for human work and makes it easy to intelligen
produce big data streams can require human operators to monitor these event str
al environmental pollution caused by human activities has become a threat to pu
ata mining were mainly attributed to human bias and shortcomings of the law ; t
data in order to analyze and predict human behavior . Over the last decade , si
re needed to interactively integrate human cognitive sensemaking activity with
computational model that mirrors the human sensemaking process , and consists o
g semantic interaction such that the human 's spatial synthesis actions are tra
at reflecting urban developments and human mobility ) to look at the impact of
configurational variables to explain human spatial behavior and spatial cogniti
 a common feature of many developing human societies . In many cases , past and
be used on a large scale to speed up human development processes in cities thro
[11]:
##
## Palabras usadas de forma similar en los mismos contextos
##
abstracts.similar("impacts")
impact most effect effects data influence characteristics levels
mobility one and patterns system field results factors dynamics
outcomes accuracy features
[12]:
##
## Contextos comunes
##
abstracts.common_contexts(["human", "interaction"])
and_networks
[13]:
##
## Conteo de palabras
##
len(abstracts)
[13]:
348451
[14]:
##
## Vocabulario
##
sorted(set(abstracts))[:30]
[14]:
['!',
 '#',
 '$',
 '%',
 '&',
 "'",
 "''",
 "'3DStock",
 "'Berlin",
 "'Big",
 "'Communities",
 "'Data",
 "'E-consultant",
 "'Engineering",
 "'European",
 "'HorVertical",
 "'JAMSTEC",
 "'Prime-Example",
 "'Research",
 "'Researcher",
 "'Spintronics",
 "'Tamburi",
 "'Virtual",
 "'age",
 "'analytical",
 "'big",
 "'data",
 "'engine",
 "'four",
 "'fuzzy"]
[15]:
##
## Vocabulario único
##
len(sorted(set(abstracts)))
[15]:
19978
[16]:
##
## Ocurrencia de una palabra
##
abstracts.count("human")
[16]:
213
[17]:
##
## Indexación de palabras
##
abstracts[98]
[17]:
'extreme'
[18]:
abstracts[98:110]
[18]:
['extreme',
 'event',
 '.',
 'We',
 'present',
 'a',
 'method',
 'to',
 'detect',
 'extreme',
 'events',
 'from']
[19]:
abstracts.index("extreme")
[19]:
53
[20]:
##
## Cómputo de la frecuencia de palabras
##
from nltk import FreqDist

fd = FreqDist(abstracts)
fd
[20]:
FreqDist({'the': 16783, ',': 13912, '.': 13814, 'of': 12884, 'and': 10793, 'to': 7812, 'a': 6364, 'in': 5374, 'data': 4468, 'is': 4107, ...})
[21]:
fd.most_common(20)
[21]:
[('the', 16783),
 (',', 13912),
 ('.', 13814),
 ('of', 12884),
 ('and', 10793),
 ('to', 7812),
 ('a', 6364),
 ('in', 5374),
 ('data', 4468),
 ('is', 4107),
 ('for', 3622),
 ('that', 3008),
 ('The', 2605),
 ('on', 2508),
 ('are', 2320),
 (')', 2245),
 ('(', 2217),
 ('with', 2149),
 ('as', 1828),
 ('this', 1794)]
[22]:
fd.plot(40, cumulative=True)
../../../_images/ciencia_datos_mineria_de_texto_notebooks_1-01_operaciones_basicas_sobre_texto_22_0.png
[23]:
##
## Collocations
##   Textos que tienden a aparecer juntos
##
abstracts.collocations()
big data; Big Data; machine learning; social media; time series;
results show; data mining; case study; supply chain; data sets;
decision making; paper presents; mobile phone; United States; paper
proposes; land use; association rules; experimental results; social
networks; recent years
[24]:
##
## Se remueven todas las palabras que no esten
## compuestas por letras
##
import re

words = [re.sub(r"[^A-Za-z]", "", w) for w in abstracts]
words = [w for w in words if w != ""]
words[:6]
[24]:
['Mobility', 'is', 'one', 'of', 'the', 'fundamental']
[25]:
##
## Se transforman las palabras minusculas
##
words = [w.lower() for w in words]
words[:6]
[25]:
['mobility', 'is', 'one', 'of', 'the', 'fundamental']
[26]:
##
## Conteo de palabras
##   Ver https://docs.python.org/3/library/collections.html
##
from collections import Counter

counter = Counter(words)
counter.most_common(10)
[26]:
[('the', 19388),
 ('of', 12891),
 ('and', 10808),
 ('to', 8065),
 ('a', 6699),
 ('in', 6627),
 ('data', 4946),
 ('is', 4126),
 ('for', 3758),
 ('that', 3017)]
[27]:
##
## Remoción de stopwords
##    pip3 install nltk
##    nltk.download('stopwords')
##
STOPWORDS = nltk.corpus.stopwords.words("english")

words = [w for w in words if w not in STOPWORDS]
counter = Counter(words)
counter.most_common(10)
[27]:
[('data', 4946),
 ('paper', 1041),
 ('model', 951),
 ('using', 920),
 ('information', 907),
 ('research', 831),
 ('results', 827),
 ('analysis', 806),
 ('based', 732),
 ('used', 730)]

Ejercicio

Para el siguiente texto:

  1. Calcule cuántas palabras únicas tiene el texto.

  2. Calcule cuántas frases tiene el texto.

  3. ¿Cúales son las diez palabras más frecuentes?

  4. ¿Cuantas palabras terminan en ing?

[28]:
texto = """
Analytics is the discovery, interpretation, and communication of meaningful patterns
in data. Especially valuable in areas rich with recorded information, analytics relies
on the simultaneous application of statistics, computer programming and operations research
to quantify performance.

Organizations may apply analytics to business data to describe, predict, and improve business
performance. Specifically, areas within analytics include predictive analytics, prescriptive
analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big
Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,
marketing optimization and marketing mix modeling, web analytics, call analytics, speech
analytics, sales force sizing and optimization, price and promotion modeling, predictive
science, credit risk analysis, and fraud analytics. Since analytics can require extensive
computation (see big data), the algorithms and software used for analytics harness the most
current methods in computer science, statistics, and mathematics.

The field of data analysis. Analytics often involves studying past historical data to
research potential trends, to analyze the effects of certain decisions or events, or to
evaluate the performance of a given tool or scenario. The goal of analytics is to improve
the business by gaining knowledge which can be used to make improvements or changes.

Data analytics (DA) is the process of examining data sets in order to draw conclusions
about the information they contain, increasingly with the aid of specialized systems
and software. Data analytics technologies and techniques are widely used in commercial
industries to enable organizations to make more-informed business decisions and by
scientists and researchers to verify or disprove scientific models, theories and
hypotheses.
"""