Visualización de la estructura del dataframe — 10:54 min

  • 10:54 min | Última modificación: Octubre 6, 2021

Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/

Pandas es una librería de alto desempeño para el manejo y análisis de datos en Python ampliamente utilizada en Analítica y Ciencia de Datos, por lo que su dominio resulta fundamental. Pandas se especializa en estructuras “tidy”, es decir, tablas de datos donde cada fila es un registro y cada columna es un atributo.

[1]:
#
# Preparación
#
import pandas as pd
import numpy as np

pd.set_option("display.notebook_repr_html", False)
[2]:
#
# Carga del archivo ddesde un repo en GitHub
#
truck_events = pd.read_csv(
    "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv",
    sep=",",
    thousands=None,
    decimal=".",
)

#
# Total de registros leídos.
#
len(truck_events)
[2]:
17075
[3]:
#
# Tamaño (filas, columnas)
#
truck_events.shape
[3]:
(17075, 12)
[4]:
#
# Visualización de los tipos de datos de las
# columnas.
#
truck_events.dtypes
[4]:
driverId           int64
truckId            int64
eventTime         object
eventType         object
longitude        float64
latitude         float64
eventKey          object
CorrelationId    float64
driverName        object
routeId            int64
routeName         object
eventDate         object
dtype: object
[5]:
#
# Visualización del nombre de las columnas.
#
truck_events.columns
[5]:
Index(['driverId', 'truckId', 'eventTime', 'eventType', 'longitude',
       'latitude', 'eventKey', 'CorrelationId', 'driverName', 'routeId',
       'routeName', 'eventDate'],
      dtype='object')
[6]:
#
# Columnas ordenadas alfabeticamente.
#
sorted(truck_events.columns)
[6]:
['CorrelationId',
 'driverId',
 'driverName',
 'eventDate',
 'eventKey',
 'eventTime',
 'eventType',
 'latitude',
 'longitude',
 'routeId',
 'routeName',
 'truckId']
[7]:
#
# Visualización del indice de las filas.
#
truck_events.index
[7]:
RangeIndex(start=0, stop=17075, step=1)
[8]:
#
# Extracción de los valores del dataframe como
# una matriz de NumPy.
#
truck_events.values
[8]:
array([[14, 25, '59:21.4', ..., 160405074,
        'Joplin to Kansas City Route 2', '2016-05-27-22'],
       [18, 16, '59:21.7', ..., 1565885487,
        'Springfield to KC Via Hanibal', '2016-05-27-22'],
       [27, 105, '59:21.7', ..., 1325562373,
        'Springfield to KC Via Columbia Route 2', '2016-05-27-22'],
       ...,
       [18, 49, '12:23.7', ..., 1565885487,
        'Springfield to KC Via Hanibal', '2016-06-02-20'],
       [10, 39, '12:23.8', ..., 1390372503, 'Saint Louis to Tulsa',
        '2016-06-02-20'],
       [19, 100, '12:24.0', ..., 1962261785,
        'Wichita to Little Rock.kml', '2016-06-02-20']], dtype=object)
[9]:
#
# Visualización de los primeros cinco registros
# del dataframe.
#
truck_events.head()
[9]:
   driverId  truckId eventTime eventType  longitude  latitude  \
0        14       25   59:21.4    Normal     -94.58     37.03
1        18       16   59:21.7    Normal     -89.66     39.78
2        27      105   59:21.7    Normal     -90.21     38.65
3        11       74   59:21.7    Normal     -90.20     38.65
4        22       87   59:21.7    Normal     -90.04     35.19

                     eventKey  CorrelationId       driverName     routeId  \
0   14|25|9223370572464814373   3.660000e+18       Adis Cesir   160405074
1   18|16|9223370572464814089   3.660000e+18        Grant Liu  1565885487
2  27|105|9223370572464814070   3.660000e+18  Mark Lochbihler  1325562373
3   11|74|9223370572464814123   3.660000e+18   Jamie Engesser  1567254452
4   22|87|9223370572464814101   3.660000e+18    Nadeem Asghar  1198242881

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
3           Saint Louis to Memphis Route2  2016-05-27-22
4           Saint Louis to Chicago Route2  2016-05-27-22
[10]:
#
# Visualización de los primeros tres registros del
# dataframe.
#
truck_events.head(n=3)
[10]:
   driverId  truckId eventTime eventType  longitude  latitude  \
0        14       25   59:21.4    Normal     -94.58     37.03
1        18       16   59:21.7    Normal     -89.66     39.78
2        27      105   59:21.7    Normal     -90.21     38.65

                     eventKey  CorrelationId       driverName     routeId  \
0   14|25|9223370572464814373   3.660000e+18       Adis Cesir   160405074
1   18|16|9223370572464814089   3.660000e+18        Grant Liu  1565885487
2  27|105|9223370572464814070   3.660000e+18  Mark Lochbihler  1325562373

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
[11]:
#
# Se puede usar un índice negativo para indicar
# cuantos elementos se eliminar al final del
# dataframe. Este dataframe tiene 17.075 filas.
#
truck_events.head(n=-17070)
[11]:
   driverId  truckId eventTime eventType  longitude  latitude  \
0        14       25   59:21.4    Normal     -94.58     37.03
1        18       16   59:21.7    Normal     -89.66     39.78
2        27      105   59:21.7    Normal     -90.21     38.65
3        11       74   59:21.7    Normal     -90.20     38.65
4        22       87   59:21.7    Normal     -90.04     35.19

                     eventKey  CorrelationId       driverName     routeId  \
0   14|25|9223370572464814373   3.660000e+18       Adis Cesir   160405074
1   18|16|9223370572464814089   3.660000e+18        Grant Liu  1565885487
2  27|105|9223370572464814070   3.660000e+18  Mark Lochbihler  1325562373
3   11|74|9223370572464814123   3.660000e+18   Jamie Engesser  1567254452
4   22|87|9223370572464814101   3.660000e+18    Nadeem Asghar  1198242881

                                routeName      eventDate
0           Joplin to Kansas City Route 2  2016-05-27-22
1           Springfield to KC Via Hanibal  2016-05-27-22
2  Springfield to KC Via Columbia Route 2  2016-05-27-22
3           Saint Louis to Memphis Route2  2016-05-27-22
4           Saint Louis to Chicago Route2  2016-05-27-22
[12]:
#
# Visualización de los últimos cinco registros
# del dataframe.
#
truck_events.tail()
[12]:
       driverId  truckId eventTime eventType  longitude  latitude  \
17070        11       27   12:23.1    Normal     -90.20     38.65
17071        16       46   12:24.0    Normal     -94.35     38.33
17072        18       49   12:23.7    Normal     -90.52     39.71
17073        10       39   12:23.8    Normal     -93.34     37.21
17074        19      100   12:24.0    Normal     -97.37     36.79

                         eventKey  CorrelationId         driverName  \
17070   11|27|9223370571956432681         1000.0     Jamie Engesser
17071   16|46|9223370571956431821         1000.0         Tom McCuch
17072   18|49|9223370571956432141         1000.0          Grant Liu
17073   10|39|9223370571956431961         1000.0  George Vetticaden
17074  19|100|9223370571956431810         1000.0         Ajay Singh

          routeId                       routeName      eventDate
17070  1198242881   Saint Louis to Chicago Route2  2016-06-02-20
17071   160405074   Joplin to Kansas City Route 2  2016-06-02-20
17072  1565885487   Springfield to KC Via Hanibal  2016-06-02-20
17073  1390372503            Saint Louis to Tulsa  2016-06-02-20
17074  1962261785      Wichita to Little Rock.kml  2016-06-02-20
[13]:
#
# Visualización de los últimos tres registros del
# dataframe.
#
truck_events.tail(n=3)
[13]:
       driverId  truckId eventTime eventType  longitude  latitude  \
17072        18       49   12:23.7    Normal     -90.52     39.71
17073        10       39   12:23.8    Normal     -93.34     37.21
17074        19      100   12:24.0    Normal     -97.37     36.79

                         eventKey  CorrelationId         driverName  \
17072   18|49|9223370571956432141         1000.0          Grant Liu
17073   10|39|9223370571956431961         1000.0  George Vetticaden
17074  19|100|9223370571956431810         1000.0         Ajay Singh

          routeId                      routeName      eventDate
17072  1565885487  Springfield to KC Via Hanibal  2016-06-02-20
17073  1390372503           Saint Louis to Tulsa  2016-06-02-20
17074  1962261785     Wichita to Little Rock.kml  2016-06-02-20
[14]:
#
# Se puede usar un indice negativo para indicar
# cuántos registros se eliminan al principio del
# dataframe.
#
truck_events.tail(n=-17070)
[14]:
       driverId  truckId eventTime eventType  longitude  latitude  \
17070        11       27   12:23.1    Normal     -90.20     38.65
17071        16       46   12:24.0    Normal     -94.35     38.33
17072        18       49   12:23.7    Normal     -90.52     39.71
17073        10       39   12:23.8    Normal     -93.34     37.21
17074        19      100   12:24.0    Normal     -97.37     36.79

                         eventKey  CorrelationId         driverName  \
17070   11|27|9223370571956432681         1000.0     Jamie Engesser
17071   16|46|9223370571956431821         1000.0         Tom McCuch
17072   18|49|9223370571956432141         1000.0          Grant Liu
17073   10|39|9223370571956431961         1000.0  George Vetticaden
17074  19|100|9223370571956431810         1000.0         Ajay Singh

          routeId                       routeName      eventDate
17070  1198242881   Saint Louis to Chicago Route2  2016-06-02-20
17071   160405074   Joplin to Kansas City Route 2  2016-06-02-20
17072  1565885487   Springfield to KC Via Hanibal  2016-06-02-20
17073  1390372503            Saint Louis to Tulsa  2016-06-02-20
17074  1962261785      Wichita to Little Rock.kml  2016-06-02-20
[15]:
#
# Visualización de una porción intermedia del
# dataframe combinando head() y tail().
#
truck_events.head(n=20).tail(n=10)
[15]:
    driverId  truckId eventTime eventType  longitude  latitude  \
10        27      105   59:22.6    Normal     -90.41     38.75
11        17       15   59:23.2    Normal     -90.55     38.81
12        14       25   59:23.3    Normal     -94.31     37.31
13        28       39   59:23.3    Normal     -89.96     39.74
14        15       51   59:23.4    Normal     -90.68     35.12
15        16       12   59:23.4    Normal     -90.29     40.96
16        31       18   59:23.5    Normal     -94.31     37.31
17        25       96   59:23.5    Normal     -90.24     38.00
18        14       25   59:24.2    Normal     -94.30     37.66
19        22       87   59:24.2    Normal     -90.94     35.03

                      eventKey  CorrelationId            driverName  \
10  27|105|9223370572464813205   3.660000e+18       Mark Lochbihler
11   17|15|9223370572464812585   3.660000e+18           Eric Mizell
12   14|25|9223370572464812526   3.660000e+18            Adis Cesir
13   28|39|9223370572464812496   3.660000e+18       Olivier Renault
14   15|51|9223370572464812405   3.660000e+18          Rohit Bakshi
15   16|12|9223370572464812395   3.660000e+18            Tom McCuch
16   31|18|9223370572464812346   3.660000e+18         Rommel Garcia
17   25|96|9223370572464812336   3.660000e+18  Jean-Philippe Player
18   14|25|9223370572464811655   3.660000e+18            Adis Cesir
19   22|87|9223370572464811652   3.660000e+18         Nadeem Asghar

       routeId                               routeName      eventDate
10  1325562373  Springfield to KC Via Columbia Route 2  2016-05-27-22
11  1927624662          Springfield to KC Via Columbia  2016-05-27-22
12   160405074           Joplin to Kansas City Route 2  2016-05-27-22
13   137128276   Springfield to KC Via Hanibal Route 2  2016-05-27-22
14  1384345811                   Joplin to Kansas City  2016-05-27-22
15  1961634315                  Saint Louis to Memphis  2016-05-27-22
16  1594289134          Memphis to Little Rock Route 2  2016-05-27-22
17   371182829                  Memphis to Little Rock  2016-05-27-22
18   160405074           Joplin to Kansas City Route 2  2016-05-27-22
19  1198242881           Saint Louis to Chicago Route2  2016-05-27-22
[16]:
#
# Visualización de los primeros cinco registros
# de una columna
#
truck_events.routeName.head()
[16]:
0             Joplin to Kansas City Route 2
1             Springfield to KC Via Hanibal
2    Springfield to KC Via Columbia Route 2
3             Saint Louis to Memphis Route2
4             Saint Louis to Chicago Route2
Name: routeName, dtype: object
[17]:
#
# Visualización de los primeros tres registros
# de una columna
#
truck_events['routeName'].head(n=3)
[17]:
0             Joplin to Kansas City Route 2
1             Springfield to KC Via Hanibal
2    Springfield to KC Via Columbia Route 2
Name: routeName, dtype: object
[18]:
#
# Visualización de los ultimos cinco registros
# de una columna
#
truck_events.routeName.tail()
[18]:
17070     Saint Louis to Chicago Route2
17071     Joplin to Kansas City Route 2
17072     Springfield to KC Via Hanibal
17073              Saint Louis to Tulsa
17074        Wichita to Little Rock.kml
Name: routeName, dtype: object
[19]:
#
# Visualización de los ultimos cinco registros
# de una columna
#
truck_events['routeName'].tail()
[19]:
17070     Saint Louis to Chicago Route2
17071     Joplin to Kansas City Route 2
17072     Springfield to KC Via Hanibal
17073              Saint Louis to Tulsa
17074        Wichita to Little Rock.kml
Name: routeName, dtype: object
[21]:
#
# Visualización del dataframe
#
truck_events
[21]:
       driverId  truckId eventTime eventType  longitude  latitude  \
0            14       25   59:21.4    Normal     -94.58     37.03
1            18       16   59:21.7    Normal     -89.66     39.78
2            27      105   59:21.7    Normal     -90.21     38.65
3            11       74   59:21.7    Normal     -90.20     38.65
4            22       87   59:21.7    Normal     -90.04     35.19
...         ...      ...       ...       ...        ...       ...
17070        11       27   12:23.1    Normal     -90.20     38.65
17071        16       46   12:24.0    Normal     -94.35     38.33
17072        18       49   12:23.7    Normal     -90.52     39.71
17073        10       39   12:23.8    Normal     -93.34     37.21
17074        19      100   12:24.0    Normal     -97.37     36.79

                         eventKey  CorrelationId         driverName  \
0       14|25|9223370572464814373   3.660000e+18         Adis Cesir
1       18|16|9223370572464814089   3.660000e+18          Grant Liu
2      27|105|9223370572464814070   3.660000e+18    Mark Lochbihler
3       11|74|9223370572464814123   3.660000e+18     Jamie Engesser
4       22|87|9223370572464814101   3.660000e+18      Nadeem Asghar
...                           ...            ...                ...
17070   11|27|9223370571956432681   1.000000e+03     Jamie Engesser
17071   16|46|9223370571956431821   1.000000e+03         Tom McCuch
17072   18|49|9223370571956432141   1.000000e+03          Grant Liu
17073   10|39|9223370571956431961   1.000000e+03  George Vetticaden
17074  19|100|9223370571956431810   1.000000e+03         Ajay Singh

          routeId                               routeName      eventDate
0       160405074           Joplin to Kansas City Route 2  2016-05-27-22
1      1565885487           Springfield to KC Via Hanibal  2016-05-27-22
2      1325562373  Springfield to KC Via Columbia Route 2  2016-05-27-22
3      1567254452           Saint Louis to Memphis Route2  2016-05-27-22
4      1198242881           Saint Louis to Chicago Route2  2016-05-27-22
...           ...                                     ...            ...
17070  1198242881           Saint Louis to Chicago Route2  2016-06-02-20
17071   160405074           Joplin to Kansas City Route 2  2016-06-02-20
17072  1565885487           Springfield to KC Via Hanibal  2016-06-02-20
17073  1390372503                    Saint Louis to Tulsa  2016-06-02-20
17074  1962261785              Wichita to Little Rock.kml  2016-06-02-20

[17075 rows x 12 columns]