{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformaciones básicas de texto\n", "\n", "* *30 min* | Última modificación: Diciembre 3, 2020" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['DOI', 'Link', 'Abstract'], dtype='object')" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Preparacion de los datos\n", "##\n", "import pandas as pd\n", "\n", "data = pd.read_csv(\n", " \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/scopus-abstracts.csv\",\n", " sep=\",\",\n", " thousands=None,\n", " decimal=\".\",\n", " encoding=\"utf-8\",\n", ")\n", "data.columns" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Mobility is one of the fundamental requirements of human life with significant societal impacts including productivity, economy, social wellbeing, adaptation to a changing climate, and so on. Although human movements follow specific patterns during normal periods, there are limited studies on how such patterns change due to extreme events. To quantify the impacts of an extreme event to human movements, we introduce the concept of mobility resilience which is defined as the ability of a mobility system to manage shocks and return to a steady state in response to an extreme event. We present a method to detect extreme events from geo-located movement data and to measure mobility resilience and transient loss of resilience due to those events. Applying this method, we measure resilience metrics from geo-located social media data for multiple types of disasters occurred all over the world. Quantifying mobility resilience may help us to assess the higher-order socio-economic impacts of extreme events and guide policies towards developing resilient infrastructures as well as a nation’s overall disaster resilience strategies. '" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Abstracts\n", "##\n", "abstracts = data.Abstract.copy()\n", "abstracts = abstracts.map(lambda w: w[0 : w.find(\"\\u00a9\")], na_action=\"ignore\")\n", "abstracts[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenizers" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "##\n", "## Numero de sentencias por abstract\n", "##\n", "import matplotlib.pyplot as plt\n", "\n", "abstracts.map(lambda w: len(w), na_action=\"ignore\").plot.hist(\n", " color=\"darkorange\", alpha=0.6, rwidth=0.8, edgecolor=\"k\"\n", ")\n", "\n", "plt.gca().spines[\"left\"].set_color(\"lightgray\")\n", "plt.gca().spines[\"bottom\"].set_color(\"gray\")\n", "plt.gca().spines[\"top\"].set_visible(False)\n", "plt.gca().spines[\"right\"].set_visible(False)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['El proceso de leudado de los productos horneados es fundamental para desarrollar sus propiedades de\\ncalidad.',\n", " 'El objetivo de este estudio fue evaluar el efecto de diferentes tipos de polvos para hornear en las \\npropiedades de calidad de muffins.',\n", " 'Se evaluaron las propiedades físico-químicas tanto del batido como del producto \\nfinal.',\n", " 'Además de su influencia en las propiedades farinológicas de la harina y las propiedades texturales y \\nsensoriales del producto en el almacenamiento.',\n", " 'Se encontró la formulación PH16 como la más adecuada, siendo la \\nde mayor altura (47.66 ± 0.35 mm), menor contenido de humedad (24.31 ± 0.18 %), menor dureza (12.34 ± 0.34 N) y \\nfirmeza de miga más baja (1.84 ± 0.01).',\n", " 'El comportamiento de la muestra PH16 en almacenamiento y a nivel sensorial \\nno tuvo diferencias significativas con la muestra control seleccionada.']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Sentence tokenizer --- Español\n", "##\n", "text_es = \"\"\"El proceso de leudado de los productos horneados es fundamental para desarrollar sus propiedades de\n", "calidad. El objetivo de este estudio fue evaluar el efecto de diferentes tipos de polvos para hornear en las \n", "propiedades de calidad de muffins. Se evaluaron las propiedades físico-químicas tanto del batido como del producto \n", "final. Además de su influencia en las propiedades farinológicas de la harina y las propiedades texturales y \n", "sensoriales del producto en el almacenamiento. Se encontró la formulación PH16 como la más adecuada, siendo la \n", "de mayor altura (47.66 ± 0.35 mm), menor contenido de humedad (24.31 ± 0.18 %), menor dureza (12.34 ± 0.34 N) y \n", "firmeza de miga más baja (1.84 ± 0.01). El comportamiento de la muestra PH16 en almacenamiento y a nivel sensorial \n", "no tuvo diferencias significativas con la muestra control seleccionada.\"\"\"\n", "\n", "nltk.sent_tokenize(text=text_es, language=\"spanish\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Mobility',\n", " 'is',\n", " 'one',\n", " 'of',\n", " 'the',\n", " 'fundamental',\n", " 'requirements',\n", " 'of',\n", " 'human',\n", " 'life',\n", " 'with',\n", " 'significant',\n", " 'societal',\n", " 'impacts',\n", " 'including',\n", " 'productivity',\n", " ',',\n", " 'economy',\n", " ',',\n", " 'social',\n", " 'wellbeing',\n", " ',',\n", " 'adaptation',\n", " 'to',\n", " 'a',\n", " 'changing',\n", " 'climate',\n", " ',',\n", " 'and',\n", " 'so',\n", " 'on',\n", " '.',\n", " 'Although',\n", " 'human',\n", " 'movements',\n", " 'follow',\n", " 'specific',\n", " 'patterns',\n", " 'during',\n", " 'normal']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Default word tokenization\n", "##\n", "\n", "#\n", "# Se extraen nuevamente los abstracts\n", "#\n", "abstracts = data.Abstract.copy()\n", "abstracts = abstracts.map(lambda w: w[0 : w.find(\"\\u00a9\")], na_action=\"ignore\")\n", "\n", "#\n", "# Default word tokenizer\n", "# Es una instancia del Treebank word tokenizer\n", "#\n", "default_word_tokenize = nltk.word_tokenize\n", "abstracts.map(default_word_tokenize, na_action=\"ignore\")[0][0:40]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Mobility',\n", " 'is',\n", " 'one',\n", " 'of',\n", " 'the',\n", " 'fundamental',\n", " 'requirements',\n", " 'of',\n", " 'human',\n", " 'life',\n", " 'with',\n", " 'significant',\n", " 'societal',\n", " 'impacts',\n", " 'including',\n", " 'productivity',\n", " ',',\n", " 'economy',\n", " ',',\n", " 'social',\n", " 'wellbeing',\n", " ',',\n", " 'adaptation',\n", " 'to',\n", " 'a',\n", " 'changing',\n", " 'climate',\n", " ',',\n", " 'and',\n", " 'so',\n", " 'on.',\n", " 'Although',\n", " 'human',\n", " 'movements',\n", " 'follow',\n", " 'specific',\n", " 'patterns',\n", " 'during',\n", " 'normal',\n", " 'periods']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## TokTok tokenizer\n", "##\n", "from nltk.tokenize.toktok import ToktokTokenizer\n", "\n", "#  note que no separa los puntos de sentencia intermedios en el parrafo\n", "toktok_word_tokenizer = nltk.ToktokTokenizer()\n", "abstracts.map(toktok_word_tokenizer.tokenize, na_action=\"ignore\")[0][0:40]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Mobility',\n", " 'is',\n", " 'one',\n", " 'of',\n", " 'the',\n", " 'fundamental',\n", " 'requirements',\n", " 'of',\n", " 'human',\n", " 'life',\n", " 'with',\n", " 'significant',\n", " 'societal',\n", " 'impacts',\n", " 'including',\n", " 'productivity',\n", " 'economy',\n", " 'social',\n", " 'wellbeing',\n", " 'adaptation',\n", " 'to',\n", " 'a',\n", " 'changing',\n", " 'climate',\n", " 'and',\n", " 'so',\n", " 'on',\n", " 'Although',\n", " 'human',\n", " 'movements',\n", " 'follow',\n", " 'specific',\n", " 'patterns',\n", " 'during',\n", " 'normal',\n", " 'periods',\n", " 'there',\n", " 'are',\n", " 'limited',\n", " 'studies']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Regexp tokenizer\n", "##\n", "from nltk import RegexpTokenizer\n", "\n", "TOKEN_PATTERN = r\"\\w+\"\n", "\n", "regex_tokenizer = nltk.RegexpTokenizer(pattern=TOKEN_PATTERN, gaps=False)\n", "abstracts.map(regex_tokenizer.tokenize, na_action=\"ignore\")[0][0:40]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0, 8),\n", " (9, 11),\n", " (12, 15),\n", " (16, 18),\n", " (19, 22),\n", " (23, 34),\n", " (35, 47),\n", " (48, 50),\n", " (51, 56),\n", " (57, 61),\n", " (62, 66),\n", " (67, 78),\n", " (79, 87),\n", " (88, 95),\n", " (96, 105),\n", " (106, 118),\n", " (120, 127),\n", " (129, 135),\n", " (136, 145),\n", " (147, 157),\n", " (158, 160),\n", " (161, 162),\n", " (163, 171),\n", " (172, 179),\n", " (181, 184),\n", " (185, 187),\n", " (188, 190),\n", " (192, 200),\n", " (201, 206),\n", " (207, 216),\n", " (217, 223),\n", " (224, 232),\n", " (233, 241),\n", " (242, 248),\n", " (249, 255),\n", " (256, 263),\n", " (265, 270),\n", " (271, 274),\n", " (275, 282),\n", " (283, 290)]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Posiciones de los tokens en el texto\n", "##\n", "abstracts.map(lambda w: list(regex_tokenizer.span_tokenize(w)), na_action=\"ignore\")[0][\n", " 0:40\n", "]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['Mobility',\n", " 'is',\n", " 'one',\n", " 'of',\n", " 'the',\n", " 'fundamental',\n", " 'requirements',\n", " 'of',\n", " 'human',\n", " 'life',\n", " 'with',\n", " 'significant',\n", " 'societal',\n", " 'impacts',\n", " 'including',\n", " 'productivity',\n", " ',',\n", " 'economy',\n", " ',',\n", " 'social',\n", " 'wellbeing',\n", " ',',\n", " 'adaptation',\n", " 'to',\n", " 'a',\n", " 'changing',\n", " 'climate',\n", " ',',\n", " 'and',\n", " 'so',\n", " 'on',\n", " '.'],\n", " ['Although',\n", " 'human',\n", " 'movements',\n", " 'follow',\n", " 'specific',\n", " 'patterns',\n", " 'during',\n", " 'normal',\n", " 'periods',\n", " ',',\n", " 'there',\n", " 'are',\n", " 'limited',\n", " 'studies',\n", " 'on',\n", " 'how',\n", " 'such',\n", " 'patterns',\n", " 'change',\n", " 'due',\n", " 'to',\n", " 'extreme',\n", " 'events',\n", " '.']]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Tokenizadores robustos\n", "##\n", "def tokenize_text(text):\n", " sentences = nltk.sent_tokenize(text)\n", " word_tokens = [nltk.word_tokenize(sentence) for sentence in sentences]\n", " return word_tokens\n", "\n", "\n", "words = abstracts.map(tokenize_text, na_action=\"ignore\")\n", "\n", "#\n", "# Dos primeras lineas del primer abstract\n", "#\n", "words[0][0:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Remoción de acentos y caracteres especiales" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'AEIOUNaeiouaiou'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Remocion de acentos\n", "##\n", "import unicodedata\n", "\n", "\n", "def remove_accented_chars(text):\n", " text = (\n", " unicodedata.normalize(\"NFKD\", text)\n", " .encode(\"ascii\", \"ignore\")\n", " .decode(\"utf-8\", \"ignore\")\n", " )\n", " return text\n", "\n", "\n", "remove_accented_chars(\"ÁÉÍÓÚÑáéíóúäïöü\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Well this was fun What do you think '" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Remoción de caracterires especiales\n", "##\n", "import re\n", "\n", "\n", "def remove_special_characters(text, remove_digits=False):\n", " pattern = r\"[^a-zA-z0-9\\s]\" if not remove_digits else r\"[^a-zA-z\\s]\"\n", " text = re.sub(pattern, \"\", text)\n", " return text\n", "\n", "\n", "remove_special_characters(\n", " \"Well this was fun! What do you think? 123#@!\", remove_digits=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Corrección de texto" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'finally'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Correccion de texto --- usando textblob\n", "## (Otras librerias: PyEnchant, aspell-python)\n", "##\n", "#  !pip3 install --user textblob\n", "\n", "from textblob import Word\n", "\n", "w = Word(\"fianlly\")\n", "w.correct()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('finally', 1.0)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# check suggestions\n", "w.spellcheck()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('flat', 0.85), ('float', 0.15)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "w = Word(\"flaot\")\n", "w.spellcheck()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stemming" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('jump', 'jump', 'jump', 'lie', 'strang')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Stemming\n", "##\n", "from nltk.stem import PorterStemmer\n", "\n", "ps = PorterStemmer()\n", "\n", "ps.stem(\"jumping\"), ps.stem(\"jumps\"), ps.stem(\"jumped\"), ps.stem(\"lying\"), ps.stem(\n", " \"strange\"\n", ")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('jump', 'jump', 'jump', 'lying', 'strange')" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import LancasterStemmer\n", "\n", "ls = LancasterStemmer()\n", "\n", "ls.stem(\"jumping\"), ls.stem(\"jumps\"), ls.stem(\"jumped\"), ls.stem(\"lying\"), ls.stem(\n", " \"strange\"\n", ")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('jump', 'jump', 'jump', 'ly', 'strange')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import RegexpStemmer\n", "\n", "rs = RegexpStemmer(\"ing$|s$|ed$\", min=4)\n", "rs.stem(\"jumping\"), rs.stem(\"jumps\"), rs.stem(\"jumped\"), rs.stem(\"lying\"), rs.stem(\n", " \"strange\"\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('jump', 'jump', 'jump', 'lie', 'strang')" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nltk.stem import SnowballStemmer\n", "\n", "ss = SnowballStemmer(\"english\")\n", "ss.stem(\"jumping\"), ss.stem(\"jumps\"), ss.stem(\"jumped\"), ss.stem(\"lying\"), ss.stem(\n", " \"strange\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lematizacion" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", "[nltk_data] Package wordnet is already up-to-date!\n", "car\n", "men\n", "running\n", "ate\n", "saddest\n", "fancier\n", "----\n", "car\n", "men\n", "run\n", "eat\n", "sad\n", "fancy\n" ] } ], "source": [ "##\n", "## Lemmatization\n", "##\n", "nltk.download(\"wordnet\")\n", "from nltk.stem import WordNetLemmatizer\n", "\n", "wnl = WordNetLemmatizer()\n", "\n", "print(wnl.lemmatize(\"cars\"))\n", "print(wnl.lemmatize(\"men\"))\n", "print(wnl.lemmatize(\"running\"))\n", "print(wnl.lemmatize(\"ate\"))\n", "print(wnl.lemmatize(\"saddest\"))\n", "print(wnl.lemmatize(\"fancier\"))\n", "print(\"----\")\n", "print(wnl.lemmatize(\"cars\", \"n\")) #  n --> nouns\n", "print(wnl.lemmatize(\"men\", \"n\"))\n", "print(wnl.lemmatize(\"running\", \"v\")) # v --> verbs\n", "print(wnl.lemmatize(\"ate\", \"v\"))\n", "print(wnl.lemmatize(\"saddest\", \"a\")) #  --> adjectves\n", "print(wnl.lemmatize(\"fancier\", \"a\"))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "##\n", "## spaCy hace la lematizaction basado en speech tagging\n", "## !pip3 install spacy\n", "## !python3 -m spacy download en_core_web_sm\n", "##\n", "# !python3 -m spacy download en_core_web_sm\n", "\n", "# import spacy\n", "\n", "# nlp = spacy.load(\"en_core_web_sm\")\n", "# nlp = en_core_web_sm.load()\n", "# text = \"My system keeps crashing his crashed yesterday, ours crashes daily\"\n", "\n", "\n", "# def lemmatize_text(text):\n", "#  text = nlp(text)\n", "#  text = \" \".join(\n", "#  [word.lemma_ if word.lemma_ != \"-PRON-\" else word.text for word in text]\n", "# )\n", "#  return text\n", "\n", "\n", "# lemmatize_text(\"My system keeps crashing! his crashed yesterday, ours crashes daily\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Esquema basico" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['\"',\n", " \"'\",\n", " ',',\n", " ',\"',\n", " '-',\n", " '-------------',\n", " '----------------------------------------------------------------------------------',\n", " '.',\n", " '.\"',\n", " '/',\n", " '01',\n", " '02',\n", " '09',\n", " '11',\n", " '12',\n", " '17',\n", " '200',\n", " '2002',\n", " '2202',\n", " '27']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "##\n", "## Clean html\n", "##\n", "from bs4 import BeautifulSoup\n", "\n", "## captura\n", "url = \"http://news.bbc.co.uk/2/hi/health/2284783.stm\"\n", "html = request.urlopen(url).read().decode(\"utf8\")\n", "\n", "##  se remueven las etiquetas\n", "raw = BeautifulSoup(html, \"html.parser\").get_text()\n", "\n", "## tokenizer\n", "tokens = nltk.wordpunct_tokenize(raw)\n", "\n", "## tokens --> text\n", "text = nltk.Text(tokens)\n", "\n", "## normalizacion\n", "## remocion de puntuacion, acentos, numeros, puntuacion, ....\n", "words = [w.lower() for w in text]\n", "\n", "# vocabulario\n", "sorted(set(words))[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Text wrapping" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1), " ] } ], "source": [ "saying = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']\n", "for word in saying:\n", " print(word, '(' + str(len(word)) + '),', end=' ')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "After 5 all 3 is 2 said 4 and 3 done 4 , 1 more 4 is 2 said 4 than 4\n", "done 4 . 1\n" ] } ], "source": [ "from textwrap import fill\n", "\n", "pieces = [\"{} {}\".format(word, len(word)) for word in saying]\n", "output = ' '.join(pieces)\n", "print(fill(output))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }