Caso práctico — Exportación de un subconjunto de datos — 6:29 min

  • 6:29 min | Última modificación: Octubre 6, 2021

Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/

Preparación

[1]:
import pandas as pd

pd.set_option("display.notebook_repr_html", False)

Carga de los datos de los eventos de los conductores

[2]:
#
# Carga del archivo ddesde un repo en GitHub
#
truck_events = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv",
)

#
# Cabecera del archivo
#
truck_events.head()
[2]:
   driverId  truckId eventTime eventType  longitude  latitude  \
0        14       25   59:21.4    Normal     -94.58     37.03
1        18       16   59:21.7    Normal     -89.66     39.78
2        27      105   59:21.7    Normal     -90.21     38.65
3        11       74   59:21.7    Normal     -90.20     38.65
4        22       87   59:21.7    Normal     -90.04     35.19

                     eventKey  CorrelationId       driverName     routeId  \
0   14|25|9223370572464814373   3.660000e+18       Adis Cesir   160405074
1   18|16|9223370572464814089   3.660000e+18        Grant Liu  1565885487
2  27|105|9223370572464814070   3.660000e+18  Mark Lochbihler  1325562373
3   11|74|9223370572464814123   3.660000e+18   Jamie Engesser  1567254452
4   22|87|9223370572464814101   3.660000e+18    Nadeem Asghar  1198242881

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
3           Saint Louis to Memphis Route2  2016-05-27-22
4           Saint Louis to Chicago Route2  2016-05-27-22

Obtención de un subconjunto de registros

En hive es posible un subconjunto de datos y almacenarlo en una nueva tabla a partir de una consulta que permita obtener los datos deseados. En el siguiente código, se crea la tabla truck_events_subset con los primeros 100 registros de la tabla truck_events.

[3]:
truck_events_subset = truck_events[0:10]
truck_events_subset
[3]:
   driverId  truckId eventTime eventType  longitude  latitude  \
0        14       25   59:21.4    Normal     -94.58     37.03
1        18       16   59:21.7    Normal     -89.66     39.78
2        27      105   59:21.7    Normal     -90.21     38.65
3        11       74   59:21.7    Normal     -90.20     38.65
4        22       87   59:21.7    Normal     -90.04     35.19
5        22       87   59:22.3    Normal     -90.37     35.21
6        23       68   59:22.4    Normal     -89.91     40.86
7        11       74   59:22.5    Normal     -89.74     39.10
8        20       41   59:22.5    Normal     -93.36     41.69
9        32       42   59:22.5    Normal     -90.37     35.21

                     eventKey  CorrelationId       driverName     routeId  \
0   14|25|9223370572464814373   3.660000e+18       Adis Cesir   160405074
1   18|16|9223370572464814089   3.660000e+18        Grant Liu  1565885487
2  27|105|9223370572464814070   3.660000e+18  Mark Lochbihler  1325562373
3   11|74|9223370572464814123   3.660000e+18   Jamie Engesser  1567254452
4   22|87|9223370572464814101   3.660000e+18    Nadeem Asghar  1198242881
5   22|87|9223370572464813486   3.660000e+18    Nadeem Asghar  1198242881
6   23|68|9223370572464813450   3.660000e+18        Adam Diaz   160405074
7   11|74|9223370572464813355   3.660000e+18   Jamie Engesser  1567254452
8   20|41|9223370572464813344   3.660000e+18     Chris Harris   160779139
9   32|42|9223370572464813296   3.660000e+18   Ryan Templeton  1090292248

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
3           Saint Louis to Memphis Route2  2016-05-27-22
4           Saint Louis to Chicago Route2  2016-05-27-22
5           Saint Louis to Chicago Route2  2016-05-27-22
6           Joplin to Kansas City Route 2  2016-05-27-22
7           Saint Louis to Memphis Route2  2016-05-27-22
8           Des Moines to Chicago Route 2  2016-05-27-22
9          Peoria to Ceder Rapids Route 2  2016-05-27-22

Obtención de un subconjunto de columnas

En el siguiente código se obtienen algunas columnas de la tabla truck_events_subset para ser almacenadas en una tabla diferente.

[4]:
specific_columns = truck_events_subset[["driverId", "eventTime", "eventType"]]
specific_columns
[4]:
   driverId eventTime eventType
0        14   59:21.4    Normal
1        18   59:21.7    Normal
2        27   59:21.7    Normal
3        11   59:21.7    Normal
4        22   59:21.7    Normal
5        22   59:22.3    Normal
6        23   59:22.4    Normal
7        11   59:22.5    Normal
8        20   59:22.5    Normal
9        32   59:22.5    Normal

## Obtención de un subconjunto de filas y columnas

[5]:
new_sub_set = truck_events.loc[0:10, ["driverId", "eventTime", "eventType"]]
new_sub_set
[5]:
    driverId eventTime eventType
0         14   59:21.4    Normal
1         18   59:21.7    Normal
2         27   59:21.7    Normal
3         11   59:21.7    Normal
4         22   59:21.7    Normal
5         22   59:22.3    Normal
6         23   59:22.4    Normal
7         11   59:22.5    Normal
8         20   59:22.5    Normal
9         32   59:22.5    Normal
10        27   59:22.6    Normal

Obtención de un campo de un registro en particular

[6]:
truck_events.iloc[1]
[6]:
driverId                                    18
truckId                                     16
eventTime                              59:21.7
eventType                               Normal
longitude                               -89.66
latitude                                 39.78
eventKey             18|16|9223370572464814089
CorrelationId                         3.66e+18
driverName                           Grant Liu
routeId                             1565885487
routeName        Springfield to KC Via Hanibal
eventDate                        2016-05-27-22
Name: 1, dtype: object
[7]:
truck_events.iloc[1].eventKey
[7]:
'18|16|9223370572464814089'
[8]:
truck_events.iloc[1]["eventKey"]
[8]:
'18|16|9223370572464814089'

Escritura de la tabla en el disco

[9]:
specific_columns.to_csv("/tmp/specific-columns.csv", sep=",", header=True, index=False)
[10]:
#
# Se visualiza el contenido del directorio
#
!ls /tmp/
specific-columns.csv
[11]:
#
# Se visualiza la parte final del archivo
#
!cat /tmp/specific-columns.csv
driverId,eventTime,eventType
14,59:21.4,Normal
18,59:21.7,Normal
27,59:21.7,Normal
11,59:21.7,Normal
22,59:21.7,Normal
22,59:22.3,Normal
23,59:22.4,Normal
11,59:22.5,Normal
20,59:22.5,Normal
32,59:22.5,Normal