Procesamiento básico de datos usando Python — 12:22 min

12:22 min | Última modificación: Octubre 12, 2021

En este tutorial se explica como realizar el procedimiento básico de datos usando Python.

Al finalizar este tutorial, usted estará en capacidad de:

Descargar archivos de datos desde internet y cargarlos como lista.
Usar la función groupby de la librería itertools.
Unir dos conjuntos de datos usando un campo clave.
Ordenar un conjunto de datos.
Búscar los registros que contienen el valor máximo o mínimo de un campo.
Escribir al disco duro los resultados.

Descarga de datos

[1]:

url_drivers = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv"
!wget --quiet {url_drivers} -P /tmp/

url_timesheet = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv"
!wget --quiet {url_timesheet} -P /tmp/

!ls -1 /tmp/*

/tmp/drivers.csv
/tmp/timesheet.csv

Impresion

[2]:

def pprint(data):
    #
    def get_format_string(data):
        format_string = []
        for i_col in range(len(data[0])):
            lengths = [len(str(row[i_col])) for row in data]
            max_length = max(lengths)
            #
            # Por ejemplo: "{:>10s}"
            #
            format_string.append("{:>" + str(max_length) + "s}")
        return format_string

    #
    def print_data(format_string, data):
        format_string = get_format_string(data)
        for index, row in enumerate(data):
            text = "    " if index == 0 else "{:2d}  ".format(index - 1)
            for fmt, value in zip(format_string, row):
                text += fmt.format(str(value)) + " "
            if len(text) >= 106:
                text = text[:100] + " [...]"
            print(text)

    #
    format_string = get_format_string(data)
    print_data(format_string, data)

Creación de la tabla drivers

[3]:

with open("/tmp/drivers.csv", "r") as file:
    drivers = file.readlines()

drivers = [row.replace("\n", "") for row in drivers]
drivers = [row.split(",") for row in drivers]
drivers = [row[:2] for row in drivers]
pprint(drivers[0:10])

    driverId              name
 0        10 George Vetticaden
 1        11    Jamie Engesser
 2        12       Paul Coddin
 3        13       Joe Niemiec
 4        14        Adis Cesir
 5        15      Rohit Bakshi
 6        16        Tom McCuch
 7        17       Eric Mizell
 8        18         Grant Liu

Creación de la tabla timesheet

[4]:

with open("/tmp/timesheet.csv", "r") as file:
    timesheet = file.readlines()

timesheet = [row.replace("\n", "") for row in timesheet]
timesheet = [row.split(",") for row in timesheet]
pprint(timesheet[:10])

    driverId week hours-logged miles-logged
 0        10    1           70         3300
 1        10    2           70         3300
 2        10    3           60         2800
 3        10    4           70         3100
 4        10    5           70         3200
 5        10    6           70         3300
 6        10    7           70         3000
 7        10    8           70         3300
 8        10    9           70         3200

[5]:

#
# Typecast de datos
#
timesheet = [
    [int(field) if i_row > 0 else field for field in row]
    for i_row, row in enumerate(timesheet)
]
pprint(timesheet[:10])

    driverId week hours-logged miles-logged
 0        10    1           70         3300
 1        10    2           70         3300
 2        10    3           60         2800
 3        10    4           70         3100
 4        10    5           70         3200
 5        10    6           70         3300
 6        10    7           70         3000
 7        10    8           70         3300
 8        10    9           70         3200

Cantidad de horas y millas de cada conductor por año

[6]:

import itertools
from operator import itemgetter

for i_key, (key, group) in enumerate(
    itertools.groupby(
        timesheet[1:],
        itemgetter(0),
    )
):
    print(key)

    for i_grp, grp in enumerate(group):
        print("   ", grp)
        if i_grp > 3:
            print("    ...")
            break

    if i_key > 2:
        print("...")
        break

10
    [10, 1, 70, 3300]
    [10, 2, 70, 3300]
    [10, 3, 60, 2800]
    [10, 4, 70, 3100]
    [10, 5, 70, 3200]
    ...
11
    [11, 1, 50, 3000]
    [11, 2, 83, 4000]
    [11, 3, 80, 4000]
    [11, 4, 85, 4000]
    [11, 5, 90, 4100]
    ...
12
    [12, 1, 49, 2783]
    [12, 2, 50, 2505]
    [12, 3, 51, 2577]
    [12, 4, 54, 2743]
    [12, 5, 47, 2791]
    ...
13
    [13, 1, 49, 2643]
    [13, 2, 56, 2553]
    [13, 3, 60, 2539]
    [13, 4, 55, 2553]
    [13, 5, 45, 2762]
    ...
...

[7]:

timesheet_grouped_by_driverId = {
    driverId: list(group)
    for driverId, group in itertools.groupby(
        timesheet[1:],
        key=itemgetter(0),
    )
}

sum_timesheet = [
    [
        driverId,
        sum([row[2] for row in timesheet_grouped_by_driverId[driverId]]),
        sum([row[3] for row in timesheet_grouped_by_driverId[driverId]]),
    ]
    for driverId in timesheet_grouped_by_driverId.keys()
]

sum_timesheet = [["driverId", "hours-logged", "miles-logged"]] + sum_timesheet

pprint(sum_timesheet)

    driverId hours-logged miles-logged
 0        10         3232       147150
 1        11         3642       179300
 2        12         2639       135962
 3        13         2727       134126
 4        14         2781       136624
 5        15         2734       138750
 6        16         2746       137205
 7        17         2701       135992
 8        18         2654       137834
 9        19         2738       137968
10        20         2644       134564
11        21         2751       138719
12        22         2733       137550
13        23         2750       137980
14        24         2647       134461
15        25         2723       139180
16        26         2730       137530
17        27         2771       137922
18        28         2723       137469
19        29         2760       138255
20        30         2773       137473
21        31         2704       137057
22        32         2736       137422
23        33         2759       139285
24        34         2811       137728
25        35         2728       138727
26        36         2795       138025
27        37         2694       137223
28        38         2760       137464
29        39         2745       138788
30        40         2700       136931
31        41         2723       138407
32        42         2697       136673
33        43         2750       136993

Unión de las tablas

[8]:

summary = [
    row_drivers + row_timesheet[1:]
    for row_drivers in drivers[1:]
    for row_timesheet in sum_timesheet[1:]
    if row_drivers[0] == str(row_timesheet[0])
]

summary = [["driverId", "name", "hours-logged", "miles-logged"]] + summary
pprint(summary)

    driverId                name hours-logged miles-logged
 0        10   George Vetticaden         3232       147150
 1        11      Jamie Engesser         3642       179300
 2        12         Paul Coddin         2639       135962
 3        13         Joe Niemiec         2727       134126
 4        14          Adis Cesir         2781       136624
 5        15        Rohit Bakshi         2734       138750
 6        16          Tom McCuch         2746       137205
 7        17         Eric Mizell         2701       135992
 8        18           Grant Liu         2654       137834
 9        19          Ajay Singh         2738       137968
10        20        Chris Harris         2644       134564
11        21        Jeff Markham         2751       138719
12        22       Nadeem Asghar         2733       137550
13        23           Adam Diaz         2750       137980
14        24         Don Hilborn         2647       134461
15        25 Jean-Philippe Playe         2723       139180
16        26        Michael Aube         2730       137530
17        27     Mark Lochbihler         2771       137922
18        28     Olivier Renault         2723       137469
19        29          Teddy Choi         2760       138255
20        30            Dan Rice         2773       137473
21        31       Rommel Garcia         2704       137057
22        32      Ryan Templeton         2736       137422
23        33   Sridhara Sabbella         2759       139285
24        34        Frank Romano         2811       137728
25        35         Emil Siemes         2728       138727
26        36       Andrew Grande         2795       138025
27        37           Wes Floyd         2694       137223
28        38          Scott Shaw         2760       137464
29        39        David Kaiser         2745       138788
30        40    Nicolas Maillard         2700       136931
31        41       Greg Phillips         2723       138407
32        42     Randy Gelhausen         2697       136673
33        43         Dave Patton         2750       136993

Ordenamiento de la tabla

[9]:

from operator import itemgetter

sorted_summary = [summary[0]] + sorted([row for row in summary[1:]], key=itemgetter(2))
pprint(sorted_summary)

    driverId                name hours-logged miles-logged
 0        12         Paul Coddin         2639       135962
 1        20        Chris Harris         2644       134564
 2        24         Don Hilborn         2647       134461
 3        18           Grant Liu         2654       137834
 4        37           Wes Floyd         2694       137223
 5        42     Randy Gelhausen         2697       136673
 6        40    Nicolas Maillard         2700       136931
 7        17         Eric Mizell         2701       135992
 8        31       Rommel Garcia         2704       137057
 9        25 Jean-Philippe Playe         2723       139180
10        28     Olivier Renault         2723       137469
11        41       Greg Phillips         2723       138407
12        13         Joe Niemiec         2727       134126
13        35         Emil Siemes         2728       138727
14        26        Michael Aube         2730       137530
15        22       Nadeem Asghar         2733       137550
16        15        Rohit Bakshi         2734       138750
17        32      Ryan Templeton         2736       137422
18        19          Ajay Singh         2738       137968
19        39        David Kaiser         2745       138788
20        16          Tom McCuch         2746       137205
21        23           Adam Diaz         2750       137980
22        43         Dave Patton         2750       136993
23        21        Jeff Markham         2751       138719
24        33   Sridhara Sabbella         2759       139285
25        29          Teddy Choi         2760       138255
26        38          Scott Shaw         2760       137464
27        27     Mark Lochbihler         2771       137922
28        30            Dan Rice         2773       137473
29        14          Adis Cesir         2781       136624
30        36       Andrew Grande         2795       138025
31        34        Frank Romano         2811       137728
32        10   George Vetticaden         3232       147150
33        11      Jamie Engesser         3642       179300

Búsqueda del máximo o el mínimo

[10]:

[
    row
    for row in sorted_summary[1:]
    if row[2] == max(aux_row[2] for aux_row in sorted_summary[1:])
]

[10]:

[['11', 'Jamie Engesser', 3642, 179300]]

Almacenamiento de los resultados

[11]:

summary = [[str(field) for field in row] for row in summary]
summary = [",".join(row) for row in summary]
summary = "\n".join(summary)

with open("/tmp/summary.csv", "w") as file:
    print(summary, file=file)

!cat /tmp/summary.csv

driverId,name,hours-logged,miles-logged
10,George Vetticaden,3232,147150
11,Jamie Engesser,3642,179300
12,Paul Coddin,2639,135962
13,Joe Niemiec,2727,134126
14,Adis Cesir,2781,136624
15,Rohit Bakshi,2734,138750
16,Tom McCuch,2746,137205
17,Eric Mizell,2701,135992
18,Grant Liu,2654,137834
19,Ajay Singh,2738,137968
20,Chris Harris,2644,134564
21,Jeff Markham,2751,138719
22,Nadeem Asghar,2733,137550
23,Adam Diaz,2750,137980
24,Don Hilborn,2647,134461
25,Jean-Philippe Playe,2723,139180
26,Michael Aube,2730,137530
27,Mark Lochbihler,2771,137922
28,Olivier Renault,2723,137469
29,Teddy Choi,2760,138255
30,Dan Rice,2773,137473
31,Rommel Garcia,2704,137057
32,Ryan Templeton,2736,137422
33,Sridhara Sabbella,2759,139285
34,Frank Romano,2811,137728
35,Emil Siemes,2728,138727
36,Andrew Grande,2795,138025
37,Wes Floyd,2694,137223
38,Scott Shaw,2760,137464
39,David Kaiser,2745,138788
40,Nicolas Maillard,2700,136931
41,Greg Phillips,2723,138407
42,Randy Gelhausen,2697,136673
43,Dave Patton,2750,136993

[12]:

!rm /tmp/*