Selección y manejo de columnas — 5:34 min
5:34 min | Última modificación: Octubre 6, 2021
Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/
Preparación
[1]:
import pandas as pd
pd.set_option("display.notebook_repr_html", False)
Carga de los datos de los eventos de los conductores
En el siguiente código se crea crea un DataFrame de eventos de los conductores en el sistema.
[2]:
#
# Carga del archivo ddesde un repo en GitHub
#
truck_events = pd.read_csv(
"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv",
sep=",",
thousands=None,
decimal=".",
)
#
# Columnas del dataframe
#
truck_events.columns
[2]:
Index(['driverId', 'truckId', 'eventTime', 'eventType', 'longitude',
'latitude', 'eventKey', 'CorrelationId', 'driverName', 'routeId',
'routeName', 'eventDate'],
dtype='object')
Obtención de un subconjunto de columnas usando [
y ]
[3]:
truck_events[
["driverId", "eventTime", "eventType"],
].head()
[3]:
driverId eventTime eventType
0 14 59:21.4 Normal
1 18 59:21.7 Normal
2 27 59:21.7 Normal
3 11 59:21.7 Normal
4 22 59:21.7 Normal
Obtención de un subcojunto de columnas usando filter()
[4]:
truck_events.filter(items=["driverId", "eventTime", "eventType"]).head()
[4]:
driverId eventTime eventType
0 14 59:21.4 Normal
1 18 59:21.7 Normal
2 27 59:21.7 Normal
3 11 59:21.7 Normal
4 22 59:21.7 Normal
Obtención de un subconjunto de columnas usando una expresión regular
[5]:
truck_events.filter(regex="e$", axis=1).head()
[5]:
eventTime eventType longitude latitude driverName \
0 59:21.4 Normal -94.58 37.03 Adis Cesir
1 59:21.7 Normal -89.66 39.78 Grant Liu
2 59:21.7 Normal -90.21 38.65 Mark Lochbihler
3 59:21.7 Normal -90.20 38.65 Jamie Engesser
4 59:21.7 Normal -90.04 35.19 Nadeem Asghar
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 Saint Louis to Memphis Route2 2016-05-27-22
4 Saint Louis to Chicago Route2 2016-05-27-22
[6]:
#
# Selección de las columnas que contienen 'ent'
#
truck_events.filter(like="ent", axis=1).head()
[6]:
eventTime eventType eventKey eventDate
0 59:21.4 Normal 14|25|9223370572464814373 2016-05-27-22
1 59:21.7 Normal 18|16|9223370572464814089 2016-05-27-22
2 59:21.7 Normal 27|105|9223370572464814070 2016-05-27-22
3 59:21.7 Normal 11|74|9223370572464814123 2016-05-27-22
4 59:21.7 Normal 22|87|9223370572464814101 2016-05-27-22
Ordenamiento de las columnas en el dataframe
[7]:
truck_events.sort_index(
axis=1,
ascending=True,
).head()
[7]:
CorrelationId driverId driverName eventDate \
0 3.660000e+18 14 Adis Cesir 2016-05-27-22
1 3.660000e+18 18 Grant Liu 2016-05-27-22
2 3.660000e+18 27 Mark Lochbihler 2016-05-27-22
3 3.660000e+18 11 Jamie Engesser 2016-05-27-22
4 3.660000e+18 22 Nadeem Asghar 2016-05-27-22
... ... ... ... ...
17070 1.000000e+03 11 Jamie Engesser 2016-06-02-20
17071 1.000000e+03 16 Tom McCuch 2016-06-02-20
17072 1.000000e+03 18 Grant Liu 2016-06-02-20
17073 1.000000e+03 10 George Vetticaden 2016-06-02-20
17074 1.000000e+03 19 Ajay Singh 2016-06-02-20
eventKey eventTime eventType latitude longitude \
0 14|25|9223370572464814373 59:21.4 Normal 37.03 -94.58
1 18|16|9223370572464814089 59:21.7 Normal 39.78 -89.66
2 27|105|9223370572464814070 59:21.7 Normal 38.65 -90.21
3 11|74|9223370572464814123 59:21.7 Normal 38.65 -90.20
4 22|87|9223370572464814101 59:21.7 Normal 35.19 -90.04
... ... ... ... ... ...
17070 11|27|9223370571956432681 12:23.1 Normal 38.65 -90.20
17071 16|46|9223370571956431821 12:24.0 Normal 38.33 -94.35
17072 18|49|9223370571956432141 12:23.7 Normal 39.71 -90.52
17073 10|39|9223370571956431961 12:23.8 Normal 37.21 -93.34
17074 19|100|9223370571956431810 12:24.0 Normal 36.79 -97.37
routeId routeName truckId
0 160405074 Joplin to Kansas City Route 2 25
1 1565885487 Springfield to KC Via Hanibal 16
2 1325562373 Springfield to KC Via Columbia Route 2 105
3 1567254452 Saint Louis to Memphis Route2 74
4 1198242881 Saint Louis to Chicago Route2 87
... ... ... ...
17070 1198242881 Saint Louis to Chicago Route2 27
17071 160405074 Joplin to Kansas City Route 2 46
17072 1565885487 Springfield to KC Via Hanibal 49
17073 1390372503 Saint Louis to Tulsa 39
17074 1962261785 Wichita to Little Rock.kml 100
[17075 rows x 12 columns]
Cambio del nombre de columnas
[8]:
truck_events.rename(
columns={
"eventTime": "event_time",
"eventType": "event_type",
},
).head()
[8]:
driverId truckId event_time event_type longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
3 11 74 59:21.7 Normal -90.20 38.65
4 22 87 59:21.7 Normal -90.04 35.19
eventKey CorrelationId driverName routeId \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074
1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373
3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452
4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 Saint Louis to Memphis Route2 2016-05-27-22
4 Saint Louis to Chicago Route2 2016-05-27-22