Selección y manejo de columnas — 5:34 min

  • 5:34 min | Última modificación: Octubre 6, 2021

Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/

Preparación

[1]:
import pandas as pd

pd.set_option("display.notebook_repr_html", False)

Carga de los datos de los eventos de los conductores

En el siguiente código se crea crea un DataFrame de eventos de los conductores en el sistema.

[2]:
#
# Carga del archivo ddesde un repo en GitHub
#
truck_events = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv",
    sep=",",
    thousands=None,
    decimal=".",
)

#
# Columnas del dataframe
#
truck_events.columns
[2]:
Index(['driverId', 'truckId', 'eventTime', 'eventType', 'longitude',
       'latitude', 'eventKey', 'CorrelationId', 'driverName', 'routeId',
       'routeName', 'eventDate'],
      dtype='object')

Obtención de un subconjunto de columnas usando [ y ]

[3]:
truck_events[
    ["driverId", "eventTime", "eventType"],
].head()
[3]:
   driverId eventTime eventType
0        14   59:21.4    Normal
1        18   59:21.7    Normal
2        27   59:21.7    Normal
3        11   59:21.7    Normal
4        22   59:21.7    Normal

Obtención de un subcojunto de columnas usando filter()

[4]:
truck_events.filter(items=["driverId", "eventTime", "eventType"]).head()
[4]:
   driverId eventTime eventType
0        14   59:21.4    Normal
1        18   59:21.7    Normal
2        27   59:21.7    Normal
3        11   59:21.7    Normal
4        22   59:21.7    Normal

Obtención de un subconjunto de columnas usando una expresión regular

[5]:
truck_events.filter(regex="e$", axis=1).head()
[5]:
  eventTime eventType  longitude  latitude       driverName  \
0   59:21.4    Normal     -94.58     37.03       Adis Cesir
1   59:21.7    Normal     -89.66     39.78        Grant Liu
2   59:21.7    Normal     -90.21     38.65  Mark Lochbihler
3   59:21.7    Normal     -90.20     38.65   Jamie Engesser
4   59:21.7    Normal     -90.04     35.19    Nadeem Asghar

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
3           Saint Louis to Memphis Route2  2016-05-27-22
4           Saint Louis to Chicago Route2  2016-05-27-22
[6]:
#
# Selección de las columnas que contienen 'ent'
#
truck_events.filter(like="ent", axis=1).head()
[6]:
  eventTime eventType                    eventKey      eventDate
0   59:21.4    Normal   14|25|9223370572464814373  2016-05-27-22
1   59:21.7    Normal   18|16|9223370572464814089  2016-05-27-22
2   59:21.7    Normal  27|105|9223370572464814070  2016-05-27-22
3   59:21.7    Normal   11|74|9223370572464814123  2016-05-27-22
4   59:21.7    Normal   22|87|9223370572464814101  2016-05-27-22

Ordenamiento de las columnas en el dataframe

[7]:
truck_events.sort_index(
    axis=1,
    ascending=True,
).head()
[7]:
       CorrelationId  driverId         driverName      eventDate  \
0       3.660000e+18        14         Adis Cesir  2016-05-27-22
1       3.660000e+18        18          Grant Liu  2016-05-27-22
2       3.660000e+18        27    Mark Lochbihler  2016-05-27-22
3       3.660000e+18        11     Jamie Engesser  2016-05-27-22
4       3.660000e+18        22      Nadeem Asghar  2016-05-27-22
...              ...       ...                ...            ...
17070   1.000000e+03        11     Jamie Engesser  2016-06-02-20
17071   1.000000e+03        16         Tom McCuch  2016-06-02-20
17072   1.000000e+03        18          Grant Liu  2016-06-02-20
17073   1.000000e+03        10  George Vetticaden  2016-06-02-20
17074   1.000000e+03        19         Ajay Singh  2016-06-02-20

                         eventKey eventTime eventType  latitude  longitude  \
0       14|25|9223370572464814373   59:21.4    Normal     37.03     -94.58
1       18|16|9223370572464814089   59:21.7    Normal     39.78     -89.66
2      27|105|9223370572464814070   59:21.7    Normal     38.65     -90.21
3       11|74|9223370572464814123   59:21.7    Normal     38.65     -90.20
4       22|87|9223370572464814101   59:21.7    Normal     35.19     -90.04
...                           ...       ...       ...       ...        ...
17070   11|27|9223370571956432681   12:23.1    Normal     38.65     -90.20
17071   16|46|9223370571956431821   12:24.0    Normal     38.33     -94.35
17072   18|49|9223370571956432141   12:23.7    Normal     39.71     -90.52
17073   10|39|9223370571956431961   12:23.8    Normal     37.21     -93.34
17074  19|100|9223370571956431810   12:24.0    Normal     36.79     -97.37

          routeId                               routeName  truckId
0       160405074           Joplin to Kansas City Route 2       25
1      1565885487           Springfield to KC Via Hanibal       16
2      1325562373  Springfield to KC Via Columbia Route 2      105
3      1567254452           Saint Louis to Memphis Route2       74
4      1198242881           Saint Louis to Chicago Route2       87
...           ...                                     ...      ...
17070  1198242881           Saint Louis to Chicago Route2       27
17071   160405074           Joplin to Kansas City Route 2       46
17072  1565885487           Springfield to KC Via Hanibal       49
17073  1390372503                    Saint Louis to Tulsa       39
17074  1962261785              Wichita to Little Rock.kml      100

[17075 rows x 12 columns]

Cambio del nombre de columnas

[8]:
truck_events.rename(
    columns={
        "eventTime": "event_time",
        "eventType": "event_type",
    },
).head()
[8]:
   driverId  truckId event_time event_type  longitude  latitude  \
0        14       25    59:21.4     Normal     -94.58     37.03
1        18       16    59:21.7     Normal     -89.66     39.78
2        27      105    59:21.7     Normal     -90.21     38.65
3        11       74    59:21.7     Normal     -90.20     38.65
4        22       87    59:21.7     Normal     -90.04     35.19

                     eventKey  CorrelationId       driverName     routeId  \
0   14|25|9223370572464814373   3.660000e+18       Adis Cesir   160405074
1   18|16|9223370572464814089   3.660000e+18        Grant Liu  1565885487
2  27|105|9223370572464814070   3.660000e+18  Mark Lochbihler  1325562373
3   11|74|9223370572464814123   3.660000e+18   Jamie Engesser  1567254452
4   22|87|9223370572464814101   3.660000e+18    Nadeem Asghar  1198242881

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
3           Saint Louis to Memphis Route2  2016-05-27-22
4           Saint Louis to Chicago Route2  2016-05-27-22