Operaciones básicas sobre archivos de datos usando Python — 20:33 min
20:33 min | Última modificación: Octubre 12, 2021
En este video se presenta un ejemplo práctico de la manipulación de archivos de datos en texto usando listas. Si bien, resulta mucho más facil realizar esta tarea usando librerías especializaxad en Python, el objetivo aquí es ejemplificar el uso de las estructuras de datos disponibles en Python puro.
Al finalizar este tutorial, usted estará en capacidad de:
Descargar archivos de datos de internet y convertirlos a una lista de campos.
Visualizar los datos como una tabla.
Reorganizar las columnas de la tabla.
Obtener un subconjunto de registros.
Filtrar los registros.
Buscar registros.
Obtener los valores únicos por campo.
Obtener un subcojunto de las columnas.
Escribir los resultados como una tabla a un archivo.
Carga de datos de los eventos de los conductores
[1]:
#
# Se descarga el archivo directamente del repo al disco duro.
#
url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv"
!wget --quiet {url} -P /tmp/
#
# Visualiza el contenido del directorio
#
!ls -1 /tmp/*
/tmp/data.csv
/tmp/demo.txt
/tmp/truck_event_text_partition.csv
/tmp/tweets.csv
/tmp/tweets.csv.1
/tmp/tweets.csv.2
/tmp/tweets.csv.3
/tmp/tweets.csv.4
[2]:
with open("/tmp/truck_event_text_partition.csv", "r") as file:
truck_events = file.readlines()
#
# Cantidad de filas del archivo incluyendo la cabecera.
#
len(truck_events)
[2]:
17076
[3]:
#
# Visualización del contenido
#
truck_events[0:2]
[3]:
['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate\n',
'14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22\n']
[4]:
#
# Limpieza
#
truck_events = [line.replace("\n", "") for line in truck_events]
[5]:
#
# Visualización del contenido
#
truck_events[0:2]
[5]:
['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate',
'14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22']
[6]:
#
# Conversión de los strings a listas
#
truck_events = [line.split(",") for line in truck_events]
[7]:
#
# Nombres de las columnas
#
truck_events[0]
[7]:
['driverId',
'truckId',
'eventTime',
'eventType',
'longitude',
'latitude',
'eventKey',
'CorrelationId',
'driverName',
'routeId',
'routeName',
'eventDate']
Visualización de datos
[8]:
#
# La visualización de datos como una lista de listas
# resulta inapropiada
#
truck_events[:2]
[8]:
[['driverId',
'truckId',
'eventTime',
'eventType',
'longitude',
'latitude',
'eventKey',
'CorrelationId',
'driverName',
'routeId',
'routeName',
'eventDate'],
['14',
'25',
'59:21.4',
'Normal',
'-94.58',
'37.03',
'14|25|9223370572464814373',
'3.66E+18',
'Adis Cesir',
'160405074',
'Joplin to Kansas City Route 2',
'2016-05-27-22']]
[9]:
def pprint(data):
#
def get_format_string(data):
format_string = []
for i_col in range(len(data[0])):
lengths = [len(str(row[i_col])) for row in data]
max_length = max(lengths)
#
# Por ejemplo: "{:>10s}"
#
format_string.append("{:>" + str(max_length) + "s}")
return format_string
#
def print_data(format_string, data):
format_string = get_format_string(data)
for index, row in enumerate(data):
text = " " if index == 0 else "{:2d} ".format(index - 1)
for fmt, value in zip(format_string, row):
text += fmt.format(str(value)) + " "
if len(text) >= 106:
text = text[:100] + " [...]"
print(text)
#
format_string = get_format_string(data)
print_data(format_string, data)
[10]:
pprint(truck_events[:5])
driverId truckId eventTime eventType longitude latitude eventKey CorrelationId [...]
0 14 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 3.66E+18 [...]
1 18 16 59:21.7 Normal -89.66 39.78 18|16|9223370572464814089 3.66E+18 [...]
2 27 105 59:21.7 Normal -90.21 38.65 27|105|9223370572464814070 3.66E+18 [...]
3 11 74 59:21.7 Normal -90.2 38.65 11|74|9223370572464814123 3.66E+18 [...]
Reorganización de las columnas
[11]:
truck_events = [
[row[0], row[8]] + [row[index] for index in range(1, 12) if index != 8]
for row in truck_events
]
pprint(truck_events[:10])
driverId driverName truckId eventTime eventType longitude latitude eventK [...]
0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|92233705724648143 [...]
1 18 Grant Liu 16 59:21.7 Normal -89.66 39.78 18|16|92233705724648140 [...]
2 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]
3 11 Jamie Engesser 74 59:21.7 Normal -90.2 38.65 11|74|92233705724648141 [...]
4 22 Nadeem Asghar 87 59:21.7 Normal -90.04 35.19 22|87|92233705724648141 [...]
5 22 Nadeem Asghar 87 59:22.3 Normal -90.37 35.21 22|87|92233705724648134 [...]
6 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|92233705724648134 [...]
7 11 Jamie Engesser 74 59:22.5 Normal -89.74 39.1 11|74|92233705724648133 [...]
8 20 Chris Harris 41 59:22.5 Normal -93.36 41.69 20|41|92233705724648133 [...]
Obtención de un subconjunto de registros
[12]:
truck_events_subset = truck_events[0:10]
pprint(truck_events_subset)
driverId driverName truckId eventTime eventType longitude latitude eventK [...]
0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|92233705724648143 [...]
1 18 Grant Liu 16 59:21.7 Normal -89.66 39.78 18|16|92233705724648140 [...]
2 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]
3 11 Jamie Engesser 74 59:21.7 Normal -90.2 38.65 11|74|92233705724648141 [...]
4 22 Nadeem Asghar 87 59:21.7 Normal -90.04 35.19 22|87|92233705724648141 [...]
5 22 Nadeem Asghar 87 59:22.3 Normal -90.37 35.21 22|87|92233705724648134 [...]
6 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|92233705724648134 [...]
7 11 Jamie Engesser 74 59:22.5 Normal -89.74 39.1 11|74|92233705724648133 [...]
8 20 Chris Harris 41 59:22.5 Normal -93.36 41.69 20|41|92233705724648133 [...]
Filtrado de registros
[13]:
truck_events_driverId_14 = [truck_events[0]] + [
row for row in truck_events if row[0] == "14"
]
pprint(truck_events_driverId_14[:10])
driverId driverName truckId eventTime eventType longitude latitude eventKey Cor [...]
0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 [...]
1 14 Adis Cesir 25 59:23.3 Normal -94.31 37.31 14|25|9223370572464812526 [...]
2 14 Adis Cesir 25 59:24.2 Normal -94.3 37.66 14|25|9223370572464811655 [...]
3 14 Adis Cesir 25 59:34.0 Normal -94.3 37.66 14|25|9223370572464801796 [...]
4 14 Adis Cesir 25 59:35.8 Normal -94.46 37.16 14|25|9223370572464800006 [...]
5 14 Adis Cesir 25 59:53.3 Normal -94.58 37.03 14|25|9223370572464782555 [...]
6 14 Adis Cesir 25 59:54.0 Normal -94.46 37.16 14|25|9223370572464781805 [...]
7 14 Adis Cesir 25 59:57.5 Normal -94.35 38.33 14|25|9223370572464778335 [...]
8 14 Adis Cesir 25 59:59.2 Normal -94.38 38.99 14|25|9223370572464776636 [...]
Búsqueda de registros usando expresiones regulares
[14]:
import re
name_finished_in_z = [truck_events[0]] + [
row for row in truck_events if re.search("z$", row[1])
]
pprint(name_finished_in_z[:10])
driverId driverName truckId eventTime eventType longitude latitude eventKey Cor [...]
0 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 [...]
1 23 Adam Diaz 68 59:26.6 Normal -91.32 41.71 23|68|9223370572464809216 [...]
2 23 Adam Diaz 68 59:27.4 Normal -91.47 41.74 23|68|9223370572464808375 [...]
3 23 Adam Diaz 68 59:28.4 Normal -91.63 41.72 23|68|9223370572464807444 [...]
4 23 Adam Diaz 68 59:29.9 Normal -91.78 42.23 23|68|9223370572464805905 [...]
5 23 Adam Diaz 68 59:30.8 Normal -91.63 41.72 23|68|9223370572464804995 [...]
6 23 Adam Diaz 68 59:37.0 Normal -89.91 40.86 23|68|9223370572464798855 [...]
7 23 Adam Diaz 68 59:39.6 Normal -89.91 40.86 23|68|9223370572464796242 [...]
8 23 Adam Diaz 68 59:40.4 Normal -90.29 40.96 23|68|9223370572464795396 [...]
[15]:
name_begins_with_Ma = [truck_events[0]] + [
row for row in truck_events if re.search("^Ma", row[1])
]
pprint(name_begins_with_Ma[:10])
driverId driverName truckId eventTime eventType longitude latitude eventK [...]
0 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]
1 27 Mark Lochbihler 105 59:22.6 Normal -90.41 38.75 27|105|92233705724648132 [...]
2 27 Mark Lochbihler 105 59:25.9 Normal -90.93 38.82 27|105|92233705724648098 [...]
3 27 Mark Lochbihler 105 59:27.7 Normal -91.19 38.83 27|105|92233705724648081 [...]
4 27 Mark Lochbihler 105 59:29.3 Normal -91.56 38.93 27|105|92233705724648065 [...]
5 27 Mark Lochbihler 105 59:35.6 Normal -92.85 38.93 27|105|92233705724648001 [...]
6 27 Mark Lochbihler 105 59:50.9 Normal -93.2 38.98 27|105|92233705724647849 [...]
7 27 Mark Lochbihler 105 59:51.8 Normal -93.01 38.97 27|105|92233705724647839 [...]
8 27 Mark Lochbihler 105 59:53.8 Normal -92.63 38.95 27|105|92233705724647820 [...]
[16]:
name_contains_ch = [truck_events[0]] + [
row for row in truck_events if re.search("ch", row[1])
]
pprint(name_contains_ch[:10])
driverId driverName truckId eventTime eventType longitude latitude eventK [...]
0 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]
1 27 Mark Lochbihler 105 59:22.6 Normal -90.41 38.75 27|105|92233705724648132 [...]
2 16 Tom McCuch 12 59:23.4 Normal -90.29 40.96 16|12|92233705724648123 [...]
3 26 Michael Aube 57 59:25.2 Normal -90.86 38.46 26|57|92233705724648106 [...]
4 16 Tom McCuch 12 59:25.3 Normal -90.7 41.62 16|12|92233705724648105 [...]
5 27 Mark Lochbihler 105 59:25.9 Normal -90.93 38.82 27|105|92233705724648098 [...]
6 26 Michael Aube 57 59:27.0 Normal -91.18 38.22 26|57|92233705724648087 [...]
7 27 Mark Lochbihler 105 59:27.7 Normal -91.19 38.83 27|105|92233705724648081 [...]
8 16 Tom McCuch 12 59:28.8 Normal -91.63 41.72 16|12|92233705724648070 [...]
[17]:
name_not_contains_i = [truck_events[0]] + [
row for row in truck_events if not re.search("i", row[1])
]
pprint(name_not_contains_i[:10])
driverId driverName truckId eventTime eventType longitude latitude eventKe [...]
0 22 Nadeem Asghar 87 59:21.7 Normal -90.04 35.19 22|87|922337057246481410 [...]
1 22 Nadeem Asghar 87 59:22.3 Normal -90.37 35.21 22|87|922337057246481348 [...]
2 32 Ryan Templeton 42 59:22.5 Normal -90.37 35.21 32|42|922337057246481329 [...]
3 16 Tom McCuch 12 59:23.4 Normal -90.29 40.96 16|12|922337057246481239 [...]
4 22 Nadeem Asghar 87 59:24.2 Normal -90.94 35.03 22|87|922337057246481165 [...]
5 32 Ryan Templeton 42 59:24.2 Normal -90.94 35.03 32|42|922337057246481159 [...]
6 22 Nadeem Asghar 87 59:25.0 Normal -91.14 34.96 22|87|922337057246481080 [...]
7 16 Tom McCuch 12 59:25.3 Normal -90.7 41.62 16|12|922337057246481053 [...]
8 21 Jeff Markham 109 59:25.7 Normal -94.35 38.33 21|109|922337057246481009 [...]
[18]:
name_not_begins_with_MN = [truck_events[0]] + [
row for row in truck_events[1:] if not re.search("^[MN]", row[1])
]
pprint(name_not_begins_with_MN[:10])
driverId driverName truckId eventTime eventType longitude latitude eventKey [...]
0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 [...]
1 18 Grant Liu 16 59:21.7 Normal -89.66 39.78 18|16|9223370572464814089 [...]
2 11 Jamie Engesser 74 59:21.7 Normal -90.2 38.65 11|74|9223370572464814123 [...]
3 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 [...]
4 11 Jamie Engesser 74 59:22.5 Normal -89.74 39.1 11|74|9223370572464813355 [...]
5 20 Chris Harris 41 59:22.5 Normal -93.36 41.69 20|41|9223370572464813344 [...]
6 32 Ryan Templeton 42 59:22.5 Normal -90.37 35.21 32|42|9223370572464813296 [...]
7 17 Eric Mizell 15 59:23.2 Normal -90.55 38.81 17|15|9223370572464812585 [...]
8 14 Adis Cesir 25 59:23.3 Normal -94.31 37.31 14|25|9223370572464812526 [...]
Borrado de registros duplicados
[19]:
truck_events_as_strings = [",".join(row) for row in truck_events[1:]]
unique_truck_events = list(set(truck_events_as_strings))
unique_truck_events = [row.split(",") for row in unique_truck_events]
unique_truck_events = [truck_events[0]] + unique_truck_events
pprint(unique_truck_events[:10])
driverId driverName truckId eventTime eventType longitude latitude event [...]
0 15 Rohit Bakshi 92 36:19.8 Normal -91.14 34.96 15|92|9223370572419395 [...]
1 31 Rommel Garcia 86 37:55.2 Normal -91.93 34.81 31|86|9223370572419300 [...]
2 14 Adis Cesir 56 36:35.7 Normal -90.29 40.96 14|56|9223370572419380 [...]
3 29 Teddy Choi 68 36:18.5 Normal -95.56 35.97 29|68|9223370572419397 [...]
4 10 George Vetticaden 39 12:03.0 Normal -93.69 37.16 10|39|9223370571956452 [...]
5 27 Mark Lochbihler 60 36:51.2 Normal -91.19 38.83 27|60|9223370572419364 [...]
6 11 Jamie Engesser 39 38:04.8 Normal -90.2 41.59 11|39|9223370572419290 [...]
7 14 Adis Cesir 18 11:46.7 Normal -91.47 41.74 14|18|9223370571956469 [...]
8 30 Dan Rice 95 09:32.7 Normal -91.59 41.7 30|95|9223370571956603 [...]
Valores únicos por campo
[20]:
driverNames = sorted({row[1] for row in truck_events[1:]})
driverNames
[20]:
['Adam Diaz',
'Adis Cesir',
'Ajay Singh',
'Chris Harris',
'Dan Rice',
'Don Hilborn',
'Eric Mizell',
'George Vetticaden',
'Grant Liu',
'Jamie Engesser',
'Jean-Philippe Player',
'Jeff Markham',
'Joe Niemiec',
'Mark Lochbihler',
'Michael Aube',
'Nadeem Asghar',
'Olivier Renault',
'Paul Codding',
'Rohit Bakshi',
'Rommel Garcia',
'Ryan Templeton',
'Teddy Choi',
'Tom McCuch']
Ordenamiento de registros con base en un campo
[21]:
from operator import itemgetter
sorted_truck_events = [truck_events[0]] + sorted(
[row for row in truck_events[1:]], key=itemgetter(1)
)
pprint(sorted_truck_events[:10])
driverId driverName truckId eventTime eventType longitude latitude eventKey Cor [...]
0 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 [...]
1 23 Adam Diaz 68 59:26.6 Normal -91.32 41.71 23|68|9223370572464809216 [...]
2 23 Adam Diaz 68 59:27.4 Normal -91.47 41.74 23|68|9223370572464808375 [...]
3 23 Adam Diaz 68 59:28.4 Normal -91.63 41.72 23|68|9223370572464807444 [...]
4 23 Adam Diaz 68 59:29.9 Normal -91.78 42.23 23|68|9223370572464805905 [...]
5 23 Adam Diaz 68 59:30.8 Normal -91.63 41.72 23|68|9223370572464804995 [...]
6 23 Adam Diaz 68 59:37.0 Normal -89.91 40.86 23|68|9223370572464798855 [...]
7 23 Adam Diaz 68 59:39.6 Normal -89.91 40.86 23|68|9223370572464796242 [...]
8 23 Adam Diaz 68 59:40.4 Normal -90.29 40.96 23|68|9223370572464795396 [...]
Obtención de un subconjunto de columnas
[22]:
#
# Seleccione las columnas "driverId", "eventTime", "eventType"
# de la variable truck_events_subset
#
column_indexes = [
i_col
for i_col, colname in enumerate(truck_events_subset[0])
if colname in ["driverId", "eventTime", "eventType"]
]
specific_columns = [
[col for i_col, col in enumerate(row) if i_col in column_indexes]
for row in truck_events_subset
]
pprint(specific_columns)
driverId eventTime eventType
0 14 59:21.4 Normal
1 18 59:21.7 Normal
2 27 59:21.7 Normal
3 11 59:21.7 Normal
4 22 59:21.7 Normal
5 22 59:22.3 Normal
6 23 59:22.4 Normal
7 11 59:22.5 Normal
8 20 59:22.5 Normal
Escritura al disco como una tabla
[23]:
specific_columns = [",".join(row) for row in specific_columns]
specific_columns = "\n".join(specific_columns)
specific_columns
[23]:
'driverId,eventTime,eventType\n14,59:21.4,Normal\n18,59:21.7,Normal\n27,59:21.7,Normal\n11,59:21.7,Normal\n22,59:21.7,Normal\n22,59:22.3,Normal\n23,59:22.4,Normal\n11,59:22.5,Normal\n20,59:22.5,Normal'
[24]:
with open("/tmp/specific_columns.csv", "w") as file:
print(specific_columns, file=file)
!cat /tmp/specific_columns.csv
driverId,eventTime,eventType
14,59:21.4,Normal
18,59:21.7,Normal
27,59:21.7,Normal
11,59:21.7,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
11,59:22.5,Normal
20,59:22.5,Normal
[25]:
!rm /tmp/*