Procesamiento básico de datos usando Python — 12:22 min
12:22 min | Última modificación: Octubre 12, 2021
En este tutorial se explica como realizar el procedimiento básico de datos usando Python.
Al finalizar este tutorial, usted estará en capacidad de:
Descargar archivos de datos desde internet y cargarlos como lista.
Usar la función groupby de la librería itertools.
Unir dos conjuntos de datos usando un campo clave.
Ordenar un conjunto de datos.
Búscar los registros que contienen el valor máximo o mínimo de un campo.
Escribir al disco duro los resultados.
Descarga de datos
[1]:
url_drivers = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv"
!wget --quiet {url_drivers} -P /tmp/
url_timesheet = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv"
!wget --quiet {url_timesheet} -P /tmp/
!ls -1 /tmp/*
/tmp/drivers.csv
/tmp/timesheet.csv
Impresion
[2]:
def pprint(data):
#
def get_format_string(data):
format_string = []
for i_col in range(len(data[0])):
lengths = [len(str(row[i_col])) for row in data]
max_length = max(lengths)
#
# Por ejemplo: "{:>10s}"
#
format_string.append("{:>" + str(max_length) + "s}")
return format_string
#
def print_data(format_string, data):
format_string = get_format_string(data)
for index, row in enumerate(data):
text = " " if index == 0 else "{:2d} ".format(index - 1)
for fmt, value in zip(format_string, row):
text += fmt.format(str(value)) + " "
if len(text) >= 106:
text = text[:100] + " [...]"
print(text)
#
format_string = get_format_string(data)
print_data(format_string, data)
Creación de la tabla drivers
[3]:
with open("/tmp/drivers.csv", "r") as file:
drivers = file.readlines()
drivers = [row.replace("\n", "") for row in drivers]
drivers = [row.split(",") for row in drivers]
drivers = [row[:2] for row in drivers]
pprint(drivers[0:10])
driverId name
0 10 George Vetticaden
1 11 Jamie Engesser
2 12 Paul Coddin
3 13 Joe Niemiec
4 14 Adis Cesir
5 15 Rohit Bakshi
6 16 Tom McCuch
7 17 Eric Mizell
8 18 Grant Liu
Creación de la tabla timesheet
[4]:
with open("/tmp/timesheet.csv", "r") as file:
timesheet = file.readlines()
timesheet = [row.replace("\n", "") for row in timesheet]
timesheet = [row.split(",") for row in timesheet]
pprint(timesheet[:10])
driverId week hours-logged miles-logged
0 10 1 70 3300
1 10 2 70 3300
2 10 3 60 2800
3 10 4 70 3100
4 10 5 70 3200
5 10 6 70 3300
6 10 7 70 3000
7 10 8 70 3300
8 10 9 70 3200
[5]:
#
# Typecast de datos
#
timesheet = [
[int(field) if i_row > 0 else field for field in row]
for i_row, row in enumerate(timesheet)
]
pprint(timesheet[:10])
driverId week hours-logged miles-logged
0 10 1 70 3300
1 10 2 70 3300
2 10 3 60 2800
3 10 4 70 3100
4 10 5 70 3200
5 10 6 70 3300
6 10 7 70 3000
7 10 8 70 3300
8 10 9 70 3200
Cantidad de horas y millas de cada conductor por año
[6]:
import itertools
from operator import itemgetter
for i_key, (key, group) in enumerate(
itertools.groupby(
timesheet[1:],
itemgetter(0),
)
):
print(key)
for i_grp, grp in enumerate(group):
print(" ", grp)
if i_grp > 3:
print(" ...")
break
if i_key > 2:
print("...")
break
10
[10, 1, 70, 3300]
[10, 2, 70, 3300]
[10, 3, 60, 2800]
[10, 4, 70, 3100]
[10, 5, 70, 3200]
...
11
[11, 1, 50, 3000]
[11, 2, 83, 4000]
[11, 3, 80, 4000]
[11, 4, 85, 4000]
[11, 5, 90, 4100]
...
12
[12, 1, 49, 2783]
[12, 2, 50, 2505]
[12, 3, 51, 2577]
[12, 4, 54, 2743]
[12, 5, 47, 2791]
...
13
[13, 1, 49, 2643]
[13, 2, 56, 2553]
[13, 3, 60, 2539]
[13, 4, 55, 2553]
[13, 5, 45, 2762]
...
...
[7]:
timesheet_grouped_by_driverId = {
driverId: list(group)
for driverId, group in itertools.groupby(
timesheet[1:],
key=itemgetter(0),
)
}
sum_timesheet = [
[
driverId,
sum([row[2] for row in timesheet_grouped_by_driverId[driverId]]),
sum([row[3] for row in timesheet_grouped_by_driverId[driverId]]),
]
for driverId in timesheet_grouped_by_driverId.keys()
]
sum_timesheet = [["driverId", "hours-logged", "miles-logged"]] + sum_timesheet
pprint(sum_timesheet)
driverId hours-logged miles-logged
0 10 3232 147150
1 11 3642 179300
2 12 2639 135962
3 13 2727 134126
4 14 2781 136624
5 15 2734 138750
6 16 2746 137205
7 17 2701 135992
8 18 2654 137834
9 19 2738 137968
10 20 2644 134564
11 21 2751 138719
12 22 2733 137550
13 23 2750 137980
14 24 2647 134461
15 25 2723 139180
16 26 2730 137530
17 27 2771 137922
18 28 2723 137469
19 29 2760 138255
20 30 2773 137473
21 31 2704 137057
22 32 2736 137422
23 33 2759 139285
24 34 2811 137728
25 35 2728 138727
26 36 2795 138025
27 37 2694 137223
28 38 2760 137464
29 39 2745 138788
30 40 2700 136931
31 41 2723 138407
32 42 2697 136673
33 43 2750 136993
Unión de las tablas
[8]:
summary = [
row_drivers + row_timesheet[1:]
for row_drivers in drivers[1:]
for row_timesheet in sum_timesheet[1:]
if row_drivers[0] == str(row_timesheet[0])
]
summary = [["driverId", "name", "hours-logged", "miles-logged"]] + summary
pprint(summary)
driverId name hours-logged miles-logged
0 10 George Vetticaden 3232 147150
1 11 Jamie Engesser 3642 179300
2 12 Paul Coddin 2639 135962
3 13 Joe Niemiec 2727 134126
4 14 Adis Cesir 2781 136624
5 15 Rohit Bakshi 2734 138750
6 16 Tom McCuch 2746 137205
7 17 Eric Mizell 2701 135992
8 18 Grant Liu 2654 137834
9 19 Ajay Singh 2738 137968
10 20 Chris Harris 2644 134564
11 21 Jeff Markham 2751 138719
12 22 Nadeem Asghar 2733 137550
13 23 Adam Diaz 2750 137980
14 24 Don Hilborn 2647 134461
15 25 Jean-Philippe Playe 2723 139180
16 26 Michael Aube 2730 137530
17 27 Mark Lochbihler 2771 137922
18 28 Olivier Renault 2723 137469
19 29 Teddy Choi 2760 138255
20 30 Dan Rice 2773 137473
21 31 Rommel Garcia 2704 137057
22 32 Ryan Templeton 2736 137422
23 33 Sridhara Sabbella 2759 139285
24 34 Frank Romano 2811 137728
25 35 Emil Siemes 2728 138727
26 36 Andrew Grande 2795 138025
27 37 Wes Floyd 2694 137223
28 38 Scott Shaw 2760 137464
29 39 David Kaiser 2745 138788
30 40 Nicolas Maillard 2700 136931
31 41 Greg Phillips 2723 138407
32 42 Randy Gelhausen 2697 136673
33 43 Dave Patton 2750 136993
Ordenamiento de la tabla
[9]:
from operator import itemgetter
sorted_summary = [summary[0]] + sorted([row for row in summary[1:]], key=itemgetter(2))
pprint(sorted_summary)
driverId name hours-logged miles-logged
0 12 Paul Coddin 2639 135962
1 20 Chris Harris 2644 134564
2 24 Don Hilborn 2647 134461
3 18 Grant Liu 2654 137834
4 37 Wes Floyd 2694 137223
5 42 Randy Gelhausen 2697 136673
6 40 Nicolas Maillard 2700 136931
7 17 Eric Mizell 2701 135992
8 31 Rommel Garcia 2704 137057
9 25 Jean-Philippe Playe 2723 139180
10 28 Olivier Renault 2723 137469
11 41 Greg Phillips 2723 138407
12 13 Joe Niemiec 2727 134126
13 35 Emil Siemes 2728 138727
14 26 Michael Aube 2730 137530
15 22 Nadeem Asghar 2733 137550
16 15 Rohit Bakshi 2734 138750
17 32 Ryan Templeton 2736 137422
18 19 Ajay Singh 2738 137968
19 39 David Kaiser 2745 138788
20 16 Tom McCuch 2746 137205
21 23 Adam Diaz 2750 137980
22 43 Dave Patton 2750 136993
23 21 Jeff Markham 2751 138719
24 33 Sridhara Sabbella 2759 139285
25 29 Teddy Choi 2760 138255
26 38 Scott Shaw 2760 137464
27 27 Mark Lochbihler 2771 137922
28 30 Dan Rice 2773 137473
29 14 Adis Cesir 2781 136624
30 36 Andrew Grande 2795 138025
31 34 Frank Romano 2811 137728
32 10 George Vetticaden 3232 147150
33 11 Jamie Engesser 3642 179300
Búsqueda del máximo o el mínimo
[10]:
[
row
for row in sorted_summary[1:]
if row[2] == max(aux_row[2] for aux_row in sorted_summary[1:])
]
[10]:
[['11', 'Jamie Engesser', 3642, 179300]]
Almacenamiento de los resultados
[11]:
summary = [[str(field) for field in row] for row in summary]
summary = [",".join(row) for row in summary]
summary = "\n".join(summary)
with open("/tmp/summary.csv", "w") as file:
print(summary, file=file)
!cat /tmp/summary.csv
driverId,name,hours-logged,miles-logged
10,George Vetticaden,3232,147150
11,Jamie Engesser,3642,179300
12,Paul Coddin,2639,135962
13,Joe Niemiec,2727,134126
14,Adis Cesir,2781,136624
15,Rohit Bakshi,2734,138750
16,Tom McCuch,2746,137205
17,Eric Mizell,2701,135992
18,Grant Liu,2654,137834
19,Ajay Singh,2738,137968
20,Chris Harris,2644,134564
21,Jeff Markham,2751,138719
22,Nadeem Asghar,2733,137550
23,Adam Diaz,2750,137980
24,Don Hilborn,2647,134461
25,Jean-Philippe Playe,2723,139180
26,Michael Aube,2730,137530
27,Mark Lochbihler,2771,137922
28,Olivier Renault,2723,137469
29,Teddy Choi,2760,138255
30,Dan Rice,2773,137473
31,Rommel Garcia,2704,137057
32,Ryan Templeton,2736,137422
33,Sridhara Sabbella,2759,139285
34,Frank Romano,2811,137728
35,Emil Siemes,2728,138727
36,Andrew Grande,2795,138025
37,Wes Floyd,2694,137223
38,Scott Shaw,2760,137464
39,David Kaiser,2745,138788
40,Nicolas Maillard,2700,136931
41,Greg Phillips,2723,138407
42,Randy Gelhausen,2697,136673
43,Dave Patton,2750,136993
[12]:
!rm /tmp/*