Operaciones básicas sobre archivos de datos usando Python — 20:33 min

  • 20:33 min | Última modificación: Octubre 12, 2021

En este video se presenta un ejemplo práctico de la manipulación de archivos de datos en texto usando listas. Si bien, resulta mucho más facil realizar esta tarea usando librerías especializaxad en Python, el objetivo aquí es ejemplificar el uso de las estructuras de datos disponibles en Python puro.

Al finalizar este tutorial, usted estará en capacidad de:

  • Descargar archivos de datos de internet y convertirlos a una lista de campos.

  • Visualizar los datos como una tabla.

  • Reorganizar las columnas de la tabla.

  • Obtener un subconjunto de registros.

  • Filtrar los registros.

  • Buscar registros.

  • Obtener los valores únicos por campo.

  • Obtener un subcojunto de las columnas.

  • Escribir los resultados como una tabla a un archivo.

Carga de datos de los eventos de los conductores

[1]:
#
# Se descarga el archivo directamente del repo al disco duro.
#
url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv"
!wget --quiet {url} -P /tmp/

#
# Visualiza el contenido del directorio
#
!ls -1 /tmp/*
/tmp/data.csv
/tmp/demo.txt
/tmp/truck_event_text_partition.csv
/tmp/tweets.csv
/tmp/tweets.csv.1
/tmp/tweets.csv.2
/tmp/tweets.csv.3
/tmp/tweets.csv.4
[2]:
with open("/tmp/truck_event_text_partition.csv", "r") as file:
    truck_events = file.readlines()

#
# Cantidad de filas del archivo incluyendo la cabecera.
#
len(truck_events)
[2]:
17076
[3]:
#
# Visualización del contenido
#
truck_events[0:2]
[3]:
['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate\n',
 '14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22\n']
[4]:
#
# Limpieza
#
truck_events = [line.replace("\n", "") for line in truck_events]
[5]:
#
# Visualización del contenido
#
truck_events[0:2]
[5]:
['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate',
 '14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22']
[6]:
#
# Conversión de los strings a listas
#
truck_events = [line.split(",") for line in truck_events]
[7]:
#
# Nombres de las columnas
#
truck_events[0]
[7]:
['driverId',
 'truckId',
 'eventTime',
 'eventType',
 'longitude',
 'latitude',
 'eventKey',
 'CorrelationId',
 'driverName',
 'routeId',
 'routeName',
 'eventDate']

Visualización de datos

[8]:
#
# La visualización de datos como una lista de listas
# resulta inapropiada
#
truck_events[:2]
[8]:
[['driverId',
  'truckId',
  'eventTime',
  'eventType',
  'longitude',
  'latitude',
  'eventKey',
  'CorrelationId',
  'driverName',
  'routeId',
  'routeName',
  'eventDate'],
 ['14',
  '25',
  '59:21.4',
  'Normal',
  '-94.58',
  '37.03',
  '14|25|9223370572464814373',
  '3.66E+18',
  'Adis Cesir',
  '160405074',
  'Joplin to Kansas City Route 2',
  '2016-05-27-22']]
[9]:
def pprint(data):
    #
    def get_format_string(data):
        format_string = []
        for i_col in range(len(data[0])):
            lengths = [len(str(row[i_col])) for row in data]
            max_length = max(lengths)
            #
            # Por ejemplo: "{:>10s}"
            #
            format_string.append("{:>" + str(max_length) + "s}")
        return format_string

    #
    def print_data(format_string, data):
        format_string = get_format_string(data)
        for index, row in enumerate(data):
            text = "    " if index == 0 else "{:2d}  ".format(index - 1)
            for fmt, value in zip(format_string, row):
                text += fmt.format(str(value)) + " "
            if len(text) >= 106:
                text = text[:100] + " [...]"
            print(text)

    #
    format_string = get_format_string(data)
    print_data(format_string, data)
[10]:
pprint(truck_events[:5])
    driverId truckId eventTime eventType longitude latitude                   eventKey CorrelationId [...]
 0        14      25   59:21.4    Normal    -94.58    37.03  14|25|9223370572464814373      3.66E+18 [...]
 1        18      16   59:21.7    Normal    -89.66    39.78  18|16|9223370572464814089      3.66E+18 [...]
 2        27     105   59:21.7    Normal    -90.21    38.65 27|105|9223370572464814070      3.66E+18 [...]
 3        11      74   59:21.7    Normal     -90.2    38.65  11|74|9223370572464814123      3.66E+18 [...]

Reorganización de las columnas

[11]:
truck_events = [
    [row[0], row[8]] + [row[index] for index in range(1, 12) if index != 8]
    for row in truck_events
]
pprint(truck_events[:10])
    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        14      Adis Cesir      25   59:21.4    Normal    -94.58    37.03  14|25|92233705724648143 [...]
 1        18       Grant Liu      16   59:21.7    Normal    -89.66    39.78  18|16|92233705724648140 [...]
 2        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 3        11  Jamie Engesser      74   59:21.7    Normal     -90.2    38.65  11|74|92233705724648141 [...]
 4        22   Nadeem Asghar      87   59:21.7    Normal    -90.04    35.19  22|87|92233705724648141 [...]
 5        22   Nadeem Asghar      87   59:22.3    Normal    -90.37    35.21  22|87|92233705724648134 [...]
 6        23       Adam Diaz      68   59:22.4    Normal    -89.91    40.86  23|68|92233705724648134 [...]
 7        11  Jamie Engesser      74   59:22.5    Normal    -89.74     39.1  11|74|92233705724648133 [...]
 8        20    Chris Harris      41   59:22.5    Normal    -93.36    41.69  20|41|92233705724648133 [...]

Obtención de un subconjunto de registros

[12]:
truck_events_subset = truck_events[0:10]
pprint(truck_events_subset)
    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        14      Adis Cesir      25   59:21.4    Normal    -94.58    37.03  14|25|92233705724648143 [...]
 1        18       Grant Liu      16   59:21.7    Normal    -89.66    39.78  18|16|92233705724648140 [...]
 2        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 3        11  Jamie Engesser      74   59:21.7    Normal     -90.2    38.65  11|74|92233705724648141 [...]
 4        22   Nadeem Asghar      87   59:21.7    Normal    -90.04    35.19  22|87|92233705724648141 [...]
 5        22   Nadeem Asghar      87   59:22.3    Normal    -90.37    35.21  22|87|92233705724648134 [...]
 6        23       Adam Diaz      68   59:22.4    Normal    -89.91    40.86  23|68|92233705724648134 [...]
 7        11  Jamie Engesser      74   59:22.5    Normal    -89.74     39.1  11|74|92233705724648133 [...]
 8        20    Chris Harris      41   59:22.5    Normal    -93.36    41.69  20|41|92233705724648133 [...]

Filtrado de registros

[13]:
truck_events_driverId_14 = [truck_events[0]] + [
    row for row in truck_events if row[0] == "14"
]
pprint(truck_events_driverId_14[:10])
    driverId driverName truckId eventTime eventType longitude latitude                  eventKey Cor [...]
 0        14 Adis Cesir      25   59:21.4    Normal    -94.58    37.03 14|25|9223370572464814373     [...]
 1        14 Adis Cesir      25   59:23.3    Normal    -94.31    37.31 14|25|9223370572464812526     [...]
 2        14 Adis Cesir      25   59:24.2    Normal     -94.3    37.66 14|25|9223370572464811655     [...]
 3        14 Adis Cesir      25   59:34.0    Normal     -94.3    37.66 14|25|9223370572464801796     [...]
 4        14 Adis Cesir      25   59:35.8    Normal    -94.46    37.16 14|25|9223370572464800006     [...]
 5        14 Adis Cesir      25   59:53.3    Normal    -94.58    37.03 14|25|9223370572464782555     [...]
 6        14 Adis Cesir      25   59:54.0    Normal    -94.46    37.16 14|25|9223370572464781805     [...]
 7        14 Adis Cesir      25   59:57.5    Normal    -94.35    38.33 14|25|9223370572464778335     [...]
 8        14 Adis Cesir      25   59:59.2    Normal    -94.38    38.99 14|25|9223370572464776636     [...]

Búsqueda de registros usando expresiones regulares

[14]:
import re

name_finished_in_z = [truck_events[0]] + [
    row for row in truck_events if re.search("z$", row[1])
]
pprint(name_finished_in_z[:10])
    driverId driverName truckId eventTime eventType longitude latitude                  eventKey Cor [...]
 0        23  Adam Diaz      68   59:22.4    Normal    -89.91    40.86 23|68|9223370572464813450     [...]
 1        23  Adam Diaz      68   59:26.6    Normal    -91.32    41.71 23|68|9223370572464809216     [...]
 2        23  Adam Diaz      68   59:27.4    Normal    -91.47    41.74 23|68|9223370572464808375     [...]
 3        23  Adam Diaz      68   59:28.4    Normal    -91.63    41.72 23|68|9223370572464807444     [...]
 4        23  Adam Diaz      68   59:29.9    Normal    -91.78    42.23 23|68|9223370572464805905     [...]
 5        23  Adam Diaz      68   59:30.8    Normal    -91.63    41.72 23|68|9223370572464804995     [...]
 6        23  Adam Diaz      68   59:37.0    Normal    -89.91    40.86 23|68|9223370572464798855     [...]
 7        23  Adam Diaz      68   59:39.6    Normal    -89.91    40.86 23|68|9223370572464796242     [...]
 8        23  Adam Diaz      68   59:40.4    Normal    -90.29    40.96 23|68|9223370572464795396     [...]
[15]:
name_begins_with_Ma = [truck_events[0]] + [
    row for row in truck_events if re.search("^Ma", row[1])
]
pprint(name_begins_with_Ma[:10])
    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 1        27 Mark Lochbihler     105   59:22.6    Normal    -90.41    38.75 27|105|92233705724648132 [...]
 2        27 Mark Lochbihler     105   59:25.9    Normal    -90.93    38.82 27|105|92233705724648098 [...]
 3        27 Mark Lochbihler     105   59:27.7    Normal    -91.19    38.83 27|105|92233705724648081 [...]
 4        27 Mark Lochbihler     105   59:29.3    Normal    -91.56    38.93 27|105|92233705724648065 [...]
 5        27 Mark Lochbihler     105   59:35.6    Normal    -92.85    38.93 27|105|92233705724648001 [...]
 6        27 Mark Lochbihler     105   59:50.9    Normal     -93.2    38.98 27|105|92233705724647849 [...]
 7        27 Mark Lochbihler     105   59:51.8    Normal    -93.01    38.97 27|105|92233705724647839 [...]
 8        27 Mark Lochbihler     105   59:53.8    Normal    -92.63    38.95 27|105|92233705724647820 [...]
[16]:
name_contains_ch = [truck_events[0]] + [
    row for row in truck_events if re.search("ch", row[1])
]
pprint(name_contains_ch[:10])
    driverId      driverName truckId eventTime eventType longitude latitude                   eventK [...]
 0        27 Mark Lochbihler     105   59:21.7    Normal    -90.21    38.65 27|105|92233705724648140 [...]
 1        27 Mark Lochbihler     105   59:22.6    Normal    -90.41    38.75 27|105|92233705724648132 [...]
 2        16      Tom McCuch      12   59:23.4    Normal    -90.29    40.96  16|12|92233705724648123 [...]
 3        26    Michael Aube      57   59:25.2    Normal    -90.86    38.46  26|57|92233705724648106 [...]
 4        16      Tom McCuch      12   59:25.3    Normal     -90.7    41.62  16|12|92233705724648105 [...]
 5        27 Mark Lochbihler     105   59:25.9    Normal    -90.93    38.82 27|105|92233705724648098 [...]
 6        26    Michael Aube      57   59:27.0    Normal    -91.18    38.22  26|57|92233705724648087 [...]
 7        27 Mark Lochbihler     105   59:27.7    Normal    -91.19    38.83 27|105|92233705724648081 [...]
 8        16      Tom McCuch      12   59:28.8    Normal    -91.63    41.72  16|12|92233705724648070 [...]
[17]:
name_not_contains_i = [truck_events[0]] + [
    row for row in truck_events if not re.search("i", row[1])
]
pprint(name_not_contains_i[:10])
    driverId     driverName truckId eventTime eventType longitude latitude                   eventKe [...]
 0        22  Nadeem Asghar      87   59:21.7    Normal    -90.04    35.19  22|87|922337057246481410 [...]
 1        22  Nadeem Asghar      87   59:22.3    Normal    -90.37    35.21  22|87|922337057246481348 [...]
 2        32 Ryan Templeton      42   59:22.5    Normal    -90.37    35.21  32|42|922337057246481329 [...]
 3        16     Tom McCuch      12   59:23.4    Normal    -90.29    40.96  16|12|922337057246481239 [...]
 4        22  Nadeem Asghar      87   59:24.2    Normal    -90.94    35.03  22|87|922337057246481165 [...]
 5        32 Ryan Templeton      42   59:24.2    Normal    -90.94    35.03  32|42|922337057246481159 [...]
 6        22  Nadeem Asghar      87   59:25.0    Normal    -91.14    34.96  22|87|922337057246481080 [...]
 7        16     Tom McCuch      12   59:25.3    Normal     -90.7    41.62  16|12|922337057246481053 [...]
 8        21   Jeff Markham     109   59:25.7    Normal    -94.35    38.33 21|109|922337057246481009 [...]
[18]:
name_not_begins_with_MN = [truck_events[0]] + [
    row for row in truck_events[1:] if not re.search("^[MN]", row[1])
]
pprint(name_not_begins_with_MN[:10])
    driverId     driverName truckId eventTime eventType longitude latitude                  eventKey [...]
 0        14     Adis Cesir      25   59:21.4    Normal    -94.58    37.03 14|25|9223370572464814373 [...]
 1        18      Grant Liu      16   59:21.7    Normal    -89.66    39.78 18|16|9223370572464814089 [...]
 2        11 Jamie Engesser      74   59:21.7    Normal     -90.2    38.65 11|74|9223370572464814123 [...]
 3        23      Adam Diaz      68   59:22.4    Normal    -89.91    40.86 23|68|9223370572464813450 [...]
 4        11 Jamie Engesser      74   59:22.5    Normal    -89.74     39.1 11|74|9223370572464813355 [...]
 5        20   Chris Harris      41   59:22.5    Normal    -93.36    41.69 20|41|9223370572464813344 [...]
 6        32 Ryan Templeton      42   59:22.5    Normal    -90.37    35.21 32|42|9223370572464813296 [...]
 7        17    Eric Mizell      15   59:23.2    Normal    -90.55    38.81 17|15|9223370572464812585 [...]
 8        14     Adis Cesir      25   59:23.3    Normal    -94.31    37.31 14|25|9223370572464812526 [...]

Borrado de registros duplicados

[19]:
truck_events_as_strings = [",".join(row) for row in truck_events[1:]]
unique_truck_events = list(set(truck_events_as_strings))
unique_truck_events = [row.split(",") for row in unique_truck_events]
unique_truck_events = [truck_events[0]] + unique_truck_events
pprint(unique_truck_events[:10])
    driverId        driverName truckId eventTime eventType longitude latitude                  event [...]
 0        15      Rohit Bakshi      92   36:19.8    Normal    -91.14    34.96 15|92|9223370572419395 [...]
 1        31     Rommel Garcia      86   37:55.2    Normal    -91.93    34.81 31|86|9223370572419300 [...]
 2        14        Adis Cesir      56   36:35.7    Normal    -90.29    40.96 14|56|9223370572419380 [...]
 3        29        Teddy Choi      68   36:18.5    Normal    -95.56    35.97 29|68|9223370572419397 [...]
 4        10 George Vetticaden      39   12:03.0    Normal    -93.69    37.16 10|39|9223370571956452 [...]
 5        27   Mark Lochbihler      60   36:51.2    Normal    -91.19    38.83 27|60|9223370572419364 [...]
 6        11    Jamie Engesser      39   38:04.8    Normal     -90.2    41.59 11|39|9223370572419290 [...]
 7        14        Adis Cesir      18   11:46.7    Normal    -91.47    41.74 14|18|9223370571956469 [...]
 8        30          Dan Rice      95   09:32.7    Normal    -91.59     41.7 30|95|9223370571956603 [...]

Valores únicos por campo

[20]:
driverNames = sorted({row[1] for row in truck_events[1:]})
driverNames
[20]:
['Adam Diaz',
 'Adis Cesir',
 'Ajay Singh',
 'Chris Harris',
 'Dan Rice',
 'Don Hilborn',
 'Eric Mizell',
 'George Vetticaden',
 'Grant Liu',
 'Jamie Engesser',
 'Jean-Philippe Player',
 'Jeff Markham',
 'Joe Niemiec',
 'Mark Lochbihler',
 'Michael Aube',
 'Nadeem Asghar',
 'Olivier Renault',
 'Paul Codding',
 'Rohit Bakshi',
 'Rommel Garcia',
 'Ryan Templeton',
 'Teddy Choi',
 'Tom McCuch']

Ordenamiento de registros con base en un campo

[21]:
from operator import itemgetter

sorted_truck_events = [truck_events[0]] + sorted(
    [row for row in truck_events[1:]], key=itemgetter(1)
)
pprint(sorted_truck_events[:10])
    driverId driverName truckId eventTime eventType longitude latitude                  eventKey Cor [...]
 0        23  Adam Diaz      68   59:22.4    Normal    -89.91    40.86 23|68|9223370572464813450     [...]
 1        23  Adam Diaz      68   59:26.6    Normal    -91.32    41.71 23|68|9223370572464809216     [...]
 2        23  Adam Diaz      68   59:27.4    Normal    -91.47    41.74 23|68|9223370572464808375     [...]
 3        23  Adam Diaz      68   59:28.4    Normal    -91.63    41.72 23|68|9223370572464807444     [...]
 4        23  Adam Diaz      68   59:29.9    Normal    -91.78    42.23 23|68|9223370572464805905     [...]
 5        23  Adam Diaz      68   59:30.8    Normal    -91.63    41.72 23|68|9223370572464804995     [...]
 6        23  Adam Diaz      68   59:37.0    Normal    -89.91    40.86 23|68|9223370572464798855     [...]
 7        23  Adam Diaz      68   59:39.6    Normal    -89.91    40.86 23|68|9223370572464796242     [...]
 8        23  Adam Diaz      68   59:40.4    Normal    -90.29    40.96 23|68|9223370572464795396     [...]

Obtención de un subconjunto de columnas

[22]:
#
# Seleccione las columnas "driverId", "eventTime", "eventType"
# de la variable truck_events_subset
#

column_indexes = [
    i_col
    for i_col, colname in enumerate(truck_events_subset[0])
    if colname in ["driverId", "eventTime", "eventType"]
]

specific_columns = [
    [col for i_col, col in enumerate(row) if i_col in column_indexes]
    for row in truck_events_subset
]

pprint(specific_columns)
    driverId eventTime eventType
 0        14   59:21.4    Normal
 1        18   59:21.7    Normal
 2        27   59:21.7    Normal
 3        11   59:21.7    Normal
 4        22   59:21.7    Normal
 5        22   59:22.3    Normal
 6        23   59:22.4    Normal
 7        11   59:22.5    Normal
 8        20   59:22.5    Normal

Escritura al disco como una tabla

[23]:
specific_columns = [",".join(row) for row in specific_columns]
specific_columns = "\n".join(specific_columns)
specific_columns
[23]:
'driverId,eventTime,eventType\n14,59:21.4,Normal\n18,59:21.7,Normal\n27,59:21.7,Normal\n11,59:21.7,Normal\n22,59:21.7,Normal\n22,59:22.3,Normal\n23,59:22.4,Normal\n11,59:22.5,Normal\n20,59:22.5,Normal'
[24]:
with open("/tmp/specific_columns.csv", "w") as file:
    print(specific_columns, file=file)

!cat /tmp/specific_columns.csv
driverId,eventTime,eventType
14,59:21.4,Normal
18,59:21.7,Normal
27,59:21.7,Normal
11,59:21.7,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
11,59:22.5,Normal
20,59:22.5,Normal
[25]:
!rm /tmp/*