Expresiones regulares en NLTK
90 min | Última modificación: Diciembre 3, 2020
[17]:
import nltk
nltk.download('treebank')
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)][:10]
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data] Package treebank is already up-to-date!
[17]:
['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5']
[18]:
[w for w in wsj if re.search('^[A-Z]+\$$', w)]
[18]:
['C$', 'US$']
[19]:
[w for w in wsj if re.search('^[0-9]{4}$', w)][:10]
[19]:
['1614',
'1637',
'1787',
'1901',
'1903',
'1917',
'1925',
'1929',
'1933',
'1934']
[20]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)][:10]
[20]:
['10-day',
'10-lap',
'10-year',
'100-share',
'12-point',
'12-year',
'14-hour',
'15-day',
'150-point',
'190-point']
[21]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
[21]:
['black-and-white',
'bread-and-butter',
'father-in-law',
'machine-gun-toting',
'savings-and-loan']
[22]:
[w for w in wsj if re.search('(ed|ing)$', w)][:10]
[22]:
['62%-owned',
'Absorbed',
'According',
'Adopting',
'Advanced',
'Advancing',
'Alfred',
'Allied',
'Annualized',
'Anything']
[23]:
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)
[23]:
['u',
'e',
'a',
'i',
'a',
'i',
'i',
'i',
'e',
'i',
'a',
'i',
'o',
'i',
'o',
'u']
[24]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)
[24]:
[('io', 549),
('ea', 476),
('ie', 331),
('ou', 329),
('ai', 261),
('ia', 253),
('ee', 217),
('oo', 174),
('ua', 109),
('au', 106),
('ue', 105),
('ui', 95)]
[25]:
nltk.download('udhr')
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
pieces = re.findall(regexp, word)
return ''.join(pieces)
english_udhr = nltk.corpus.udhr.words('English-Latin1')
nltk.tokenwrap(compress(w) for w in english_udhr[:75])
[nltk_data] Downloading package udhr to /root/nltk_data...
[nltk_data] Package udhr is already up-to-date!
[25]:
'Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and\nof the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn\nof frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn\nrghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,\nand the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and'
[45]:
nltk.download('gutenberg')
from nltk.corpus import gutenberg
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
##
## Captura la palabra en `a XXX man`
##
moby.findall(r"<a> (<.*>) <man>")
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data] Package gutenberg is already up-to-date!
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
[46]:
nltk.download('nps_chat')
from nltk.corpus import gutenberg, nps_chat
chat = nltk.Text(nps_chat.words())
chat.findall(r"<.*> <.*> <bro>")
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data] Package nps_chat is already up-to-date!
you rule bro; telling you bro; u twizted bro
[47]:
chat.findall(r"<l.*>{3,}")
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la
[48]:
nltk.download('brown')
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data] Package brown is already up-to-date!
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals
Baby names exercise from Google Education
Descargue el archivo https://developers.google.com/edu/python/google-python-exercises.zip y descomprimalo.
Los archivos baby1990.html, baby1991.html contienen el código HTML de la página web que publica los nombres más populares para bebes nacidos en el correspondiente año.
Escriba una función que retorne una lista simple que contiene el año, y posteriormente el nombre y su posición. La lista solicitada debe presentar los nombres en orden alfabético y debe considerar simultaneamente los nombres de niños y niñas.