Caso práctico — Exportación de un subconjunto de datos — 6:29 min
6:29 min | Última modificación: Octubre 6, 2021
Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/
Preparación
[1]:
import pandas as pd
pd.set_option("display.notebook_repr_html", False)
Carga de los datos de los eventos de los conductores
[2]:
#
# Carga del archivo ddesde un repo en GitHub
#
truck_events = pd.read_csv(
"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv",
)
#
# Cabecera del archivo
#
truck_events.head()
[2]:
driverId truckId eventTime eventType longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
3 11 74 59:21.7 Normal -90.20 38.65
4 22 87 59:21.7 Normal -90.04 35.19
eventKey CorrelationId driverName routeId \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074
1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373
3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452
4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 Saint Louis to Memphis Route2 2016-05-27-22
4 Saint Louis to Chicago Route2 2016-05-27-22
Obtención de un subconjunto de registros
En hive es posible un subconjunto de datos y almacenarlo en una nueva tabla a partir de una consulta que permita obtener los datos deseados. En el siguiente código, se crea la tabla truck_events_subset
con los primeros 100 registros de la tabla truck_events
.
[3]:
truck_events_subset = truck_events[0:10]
truck_events_subset
[3]:
driverId truckId eventTime eventType longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
3 11 74 59:21.7 Normal -90.20 38.65
4 22 87 59:21.7 Normal -90.04 35.19
5 22 87 59:22.3 Normal -90.37 35.21
6 23 68 59:22.4 Normal -89.91 40.86
7 11 74 59:22.5 Normal -89.74 39.10
8 20 41 59:22.5 Normal -93.36 41.69
9 32 42 59:22.5 Normal -90.37 35.21
eventKey CorrelationId driverName routeId \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074
1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373
3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452
4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881
5 22|87|9223370572464813486 3.660000e+18 Nadeem Asghar 1198242881
6 23|68|9223370572464813450 3.660000e+18 Adam Diaz 160405074
7 11|74|9223370572464813355 3.660000e+18 Jamie Engesser 1567254452
8 20|41|9223370572464813344 3.660000e+18 Chris Harris 160779139
9 32|42|9223370572464813296 3.660000e+18 Ryan Templeton 1090292248
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 Saint Louis to Memphis Route2 2016-05-27-22
4 Saint Louis to Chicago Route2 2016-05-27-22
5 Saint Louis to Chicago Route2 2016-05-27-22
6 Joplin to Kansas City Route 2 2016-05-27-22
7 Saint Louis to Memphis Route2 2016-05-27-22
8 Des Moines to Chicago Route 2 2016-05-27-22
9 Peoria to Ceder Rapids Route 2 2016-05-27-22
Obtención de un subconjunto de columnas
En el siguiente código se obtienen algunas columnas de la tabla truck_events_subset
para ser almacenadas en una tabla diferente.
[4]:
specific_columns = truck_events_subset[["driverId", "eventTime", "eventType"]]
specific_columns
[4]:
driverId eventTime eventType
0 14 59:21.4 Normal
1 18 59:21.7 Normal
2 27 59:21.7 Normal
3 11 59:21.7 Normal
4 22 59:21.7 Normal
5 22 59:22.3 Normal
6 23 59:22.4 Normal
7 11 59:22.5 Normal
8 20 59:22.5 Normal
9 32 59:22.5 Normal
## Obtención de un subconjunto de filas y columnas
[5]:
new_sub_set = truck_events.loc[0:10, ["driverId", "eventTime", "eventType"]]
new_sub_set
[5]:
driverId eventTime eventType
0 14 59:21.4 Normal
1 18 59:21.7 Normal
2 27 59:21.7 Normal
3 11 59:21.7 Normal
4 22 59:21.7 Normal
5 22 59:22.3 Normal
6 23 59:22.4 Normal
7 11 59:22.5 Normal
8 20 59:22.5 Normal
9 32 59:22.5 Normal
10 27 59:22.6 Normal
Obtención de un campo de un registro en particular
[6]:
truck_events.iloc[1]
[6]:
driverId 18
truckId 16
eventTime 59:21.7
eventType Normal
longitude -89.66
latitude 39.78
eventKey 18|16|9223370572464814089
CorrelationId 3.66e+18
driverName Grant Liu
routeId 1565885487
routeName Springfield to KC Via Hanibal
eventDate 2016-05-27-22
Name: 1, dtype: object
[7]:
truck_events.iloc[1].eventKey
[7]:
'18|16|9223370572464814089'
[8]:
truck_events.iloc[1]["eventKey"]
[8]:
'18|16|9223370572464814089'
Escritura de la tabla en el disco
[9]:
specific_columns.to_csv("/tmp/specific-columns.csv", sep=",", header=True, index=False)
[10]:
#
# Se visualiza el contenido del directorio
#
!ls /tmp/
specific-columns.csv
[11]:
#
# Se visualiza la parte final del archivo
#
!cat /tmp/specific-columns.csv
driverId,eventTime,eventType
14,59:21.4,Normal
18,59:21.7,Normal
27,59:21.7,Normal
11,59:21.7,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
11,59:22.5,Normal
20,59:22.5,Normal
32,59:22.5,Normal