Visualización de la estructura del dataframe — 10:54 min
10:54 min | Última modificación: Octubre 6, 2021
Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/
Pandas es una librería de alto desempeño para el manejo y análisis de datos en Python ampliamente utilizada en Analítica y Ciencia de Datos, por lo que su dominio resulta fundamental. Pandas se especializa en estructuras “tidy”, es decir, tablas de datos donde cada fila es un registro y cada columna es un atributo.
[1]:
#
# Preparación
#
import pandas as pd
import numpy as np
pd.set_option("display.notebook_repr_html", False)
[2]:
#
# Carga del archivo ddesde un repo en GitHub
#
truck_events = pd.read_csv(
"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv",
sep=",",
thousands=None,
decimal=".",
)
#
# Total de registros leídos.
#
len(truck_events)
[2]:
17075
[3]:
#
# Tamaño (filas, columnas)
#
truck_events.shape
[3]:
(17075, 12)
[4]:
#
# Visualización de los tipos de datos de las
# columnas.
#
truck_events.dtypes
[4]:
driverId int64
truckId int64
eventTime object
eventType object
longitude float64
latitude float64
eventKey object
CorrelationId float64
driverName object
routeId int64
routeName object
eventDate object
dtype: object
[5]:
#
# Visualización del nombre de las columnas.
#
truck_events.columns
[5]:
Index(['driverId', 'truckId', 'eventTime', 'eventType', 'longitude',
'latitude', 'eventKey', 'CorrelationId', 'driverName', 'routeId',
'routeName', 'eventDate'],
dtype='object')
[6]:
#
# Columnas ordenadas alfabeticamente.
#
sorted(truck_events.columns)
[6]:
['CorrelationId',
'driverId',
'driverName',
'eventDate',
'eventKey',
'eventTime',
'eventType',
'latitude',
'longitude',
'routeId',
'routeName',
'truckId']
[7]:
#
# Visualización del indice de las filas.
#
truck_events.index
[7]:
RangeIndex(start=0, stop=17075, step=1)
[8]:
#
# Extracción de los valores del dataframe como
# una matriz de NumPy.
#
truck_events.values
[8]:
array([[14, 25, '59:21.4', ..., 160405074,
'Joplin to Kansas City Route 2', '2016-05-27-22'],
[18, 16, '59:21.7', ..., 1565885487,
'Springfield to KC Via Hanibal', '2016-05-27-22'],
[27, 105, '59:21.7', ..., 1325562373,
'Springfield to KC Via Columbia Route 2', '2016-05-27-22'],
...,
[18, 49, '12:23.7', ..., 1565885487,
'Springfield to KC Via Hanibal', '2016-06-02-20'],
[10, 39, '12:23.8', ..., 1390372503, 'Saint Louis to Tulsa',
'2016-06-02-20'],
[19, 100, '12:24.0', ..., 1962261785,
'Wichita to Little Rock.kml', '2016-06-02-20']], dtype=object)
[9]:
#
# Visualización de los primeros cinco registros
# del dataframe.
#
truck_events.head()
[9]:
driverId truckId eventTime eventType longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
3 11 74 59:21.7 Normal -90.20 38.65
4 22 87 59:21.7 Normal -90.04 35.19
eventKey CorrelationId driverName routeId \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074
1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373
3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452
4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 Saint Louis to Memphis Route2 2016-05-27-22
4 Saint Louis to Chicago Route2 2016-05-27-22
[10]:
#
# Visualización de los primeros tres registros del
# dataframe.
#
truck_events.head(n=3)
[10]:
driverId truckId eventTime eventType longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
eventKey CorrelationId driverName routeId \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074
1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
[11]:
#
# Se puede usar un índice negativo para indicar
# cuantos elementos se eliminar al final del
# dataframe. Este dataframe tiene 17.075 filas.
#
truck_events.head(n=-17070)
[11]:
driverId truckId eventTime eventType longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
3 11 74 59:21.7 Normal -90.20 38.65
4 22 87 59:21.7 Normal -90.04 35.19
eventKey CorrelationId driverName routeId \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074
1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373
3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452
4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881
routeName eventDate
0 Joplin to Kansas City Route 2 2016-05-27-22
1 Springfield to KC Via Hanibal 2016-05-27-22
2 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 Saint Louis to Memphis Route2 2016-05-27-22
4 Saint Louis to Chicago Route2 2016-05-27-22
[12]:
#
# Visualización de los últimos cinco registros
# del dataframe.
#
truck_events.tail()
[12]:
driverId truckId eventTime eventType longitude latitude \
17070 11 27 12:23.1 Normal -90.20 38.65
17071 16 46 12:24.0 Normal -94.35 38.33
17072 18 49 12:23.7 Normal -90.52 39.71
17073 10 39 12:23.8 Normal -93.34 37.21
17074 19 100 12:24.0 Normal -97.37 36.79
eventKey CorrelationId driverName \
17070 11|27|9223370571956432681 1000.0 Jamie Engesser
17071 16|46|9223370571956431821 1000.0 Tom McCuch
17072 18|49|9223370571956432141 1000.0 Grant Liu
17073 10|39|9223370571956431961 1000.0 George Vetticaden
17074 19|100|9223370571956431810 1000.0 Ajay Singh
routeId routeName eventDate
17070 1198242881 Saint Louis to Chicago Route2 2016-06-02-20
17071 160405074 Joplin to Kansas City Route 2 2016-06-02-20
17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20
17073 1390372503 Saint Louis to Tulsa 2016-06-02-20
17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20
[13]:
#
# Visualización de los últimos tres registros del
# dataframe.
#
truck_events.tail(n=3)
[13]:
driverId truckId eventTime eventType longitude latitude \
17072 18 49 12:23.7 Normal -90.52 39.71
17073 10 39 12:23.8 Normal -93.34 37.21
17074 19 100 12:24.0 Normal -97.37 36.79
eventKey CorrelationId driverName \
17072 18|49|9223370571956432141 1000.0 Grant Liu
17073 10|39|9223370571956431961 1000.0 George Vetticaden
17074 19|100|9223370571956431810 1000.0 Ajay Singh
routeId routeName eventDate
17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20
17073 1390372503 Saint Louis to Tulsa 2016-06-02-20
17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20
[14]:
#
# Se puede usar un indice negativo para indicar
# cuántos registros se eliminan al principio del
# dataframe.
#
truck_events.tail(n=-17070)
[14]:
driverId truckId eventTime eventType longitude latitude \
17070 11 27 12:23.1 Normal -90.20 38.65
17071 16 46 12:24.0 Normal -94.35 38.33
17072 18 49 12:23.7 Normal -90.52 39.71
17073 10 39 12:23.8 Normal -93.34 37.21
17074 19 100 12:24.0 Normal -97.37 36.79
eventKey CorrelationId driverName \
17070 11|27|9223370571956432681 1000.0 Jamie Engesser
17071 16|46|9223370571956431821 1000.0 Tom McCuch
17072 18|49|9223370571956432141 1000.0 Grant Liu
17073 10|39|9223370571956431961 1000.0 George Vetticaden
17074 19|100|9223370571956431810 1000.0 Ajay Singh
routeId routeName eventDate
17070 1198242881 Saint Louis to Chicago Route2 2016-06-02-20
17071 160405074 Joplin to Kansas City Route 2 2016-06-02-20
17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20
17073 1390372503 Saint Louis to Tulsa 2016-06-02-20
17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20
[15]:
#
# Visualización de una porción intermedia del
# dataframe combinando head() y tail().
#
truck_events.head(n=20).tail(n=10)
[15]:
driverId truckId eventTime eventType longitude latitude \
10 27 105 59:22.6 Normal -90.41 38.75
11 17 15 59:23.2 Normal -90.55 38.81
12 14 25 59:23.3 Normal -94.31 37.31
13 28 39 59:23.3 Normal -89.96 39.74
14 15 51 59:23.4 Normal -90.68 35.12
15 16 12 59:23.4 Normal -90.29 40.96
16 31 18 59:23.5 Normal -94.31 37.31
17 25 96 59:23.5 Normal -90.24 38.00
18 14 25 59:24.2 Normal -94.30 37.66
19 22 87 59:24.2 Normal -90.94 35.03
eventKey CorrelationId driverName \
10 27|105|9223370572464813205 3.660000e+18 Mark Lochbihler
11 17|15|9223370572464812585 3.660000e+18 Eric Mizell
12 14|25|9223370572464812526 3.660000e+18 Adis Cesir
13 28|39|9223370572464812496 3.660000e+18 Olivier Renault
14 15|51|9223370572464812405 3.660000e+18 Rohit Bakshi
15 16|12|9223370572464812395 3.660000e+18 Tom McCuch
16 31|18|9223370572464812346 3.660000e+18 Rommel Garcia
17 25|96|9223370572464812336 3.660000e+18 Jean-Philippe Player
18 14|25|9223370572464811655 3.660000e+18 Adis Cesir
19 22|87|9223370572464811652 3.660000e+18 Nadeem Asghar
routeId routeName eventDate
10 1325562373 Springfield to KC Via Columbia Route 2 2016-05-27-22
11 1927624662 Springfield to KC Via Columbia 2016-05-27-22
12 160405074 Joplin to Kansas City Route 2 2016-05-27-22
13 137128276 Springfield to KC Via Hanibal Route 2 2016-05-27-22
14 1384345811 Joplin to Kansas City 2016-05-27-22
15 1961634315 Saint Louis to Memphis 2016-05-27-22
16 1594289134 Memphis to Little Rock Route 2 2016-05-27-22
17 371182829 Memphis to Little Rock 2016-05-27-22
18 160405074 Joplin to Kansas City Route 2 2016-05-27-22
19 1198242881 Saint Louis to Chicago Route2 2016-05-27-22
[16]:
#
# Visualización de los primeros cinco registros
# de una columna
#
truck_events.routeName.head()
[16]:
0 Joplin to Kansas City Route 2
1 Springfield to KC Via Hanibal
2 Springfield to KC Via Columbia Route 2
3 Saint Louis to Memphis Route2
4 Saint Louis to Chicago Route2
Name: routeName, dtype: object
[17]:
#
# Visualización de los primeros tres registros
# de una columna
#
truck_events['routeName'].head(n=3)
[17]:
0 Joplin to Kansas City Route 2
1 Springfield to KC Via Hanibal
2 Springfield to KC Via Columbia Route 2
Name: routeName, dtype: object
[18]:
#
# Visualización de los ultimos cinco registros
# de una columna
#
truck_events.routeName.tail()
[18]:
17070 Saint Louis to Chicago Route2
17071 Joplin to Kansas City Route 2
17072 Springfield to KC Via Hanibal
17073 Saint Louis to Tulsa
17074 Wichita to Little Rock.kml
Name: routeName, dtype: object
[19]:
#
# Visualización de los ultimos cinco registros
# de una columna
#
truck_events['routeName'].tail()
[19]:
17070 Saint Louis to Chicago Route2
17071 Joplin to Kansas City Route 2
17072 Springfield to KC Via Hanibal
17073 Saint Louis to Tulsa
17074 Wichita to Little Rock.kml
Name: routeName, dtype: object
[21]:
#
# Visualización del dataframe
#
truck_events
[21]:
driverId truckId eventTime eventType longitude latitude \
0 14 25 59:21.4 Normal -94.58 37.03
1 18 16 59:21.7 Normal -89.66 39.78
2 27 105 59:21.7 Normal -90.21 38.65
3 11 74 59:21.7 Normal -90.20 38.65
4 22 87 59:21.7 Normal -90.04 35.19
... ... ... ... ... ... ...
17070 11 27 12:23.1 Normal -90.20 38.65
17071 16 46 12:24.0 Normal -94.35 38.33
17072 18 49 12:23.7 Normal -90.52 39.71
17073 10 39 12:23.8 Normal -93.34 37.21
17074 19 100 12:24.0 Normal -97.37 36.79
eventKey CorrelationId driverName \
0 14|25|9223370572464814373 3.660000e+18 Adis Cesir
1 18|16|9223370572464814089 3.660000e+18 Grant Liu
2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler
3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser
4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar
... ... ... ...
17070 11|27|9223370571956432681 1.000000e+03 Jamie Engesser
17071 16|46|9223370571956431821 1.000000e+03 Tom McCuch
17072 18|49|9223370571956432141 1.000000e+03 Grant Liu
17073 10|39|9223370571956431961 1.000000e+03 George Vetticaden
17074 19|100|9223370571956431810 1.000000e+03 Ajay Singh
routeId routeName eventDate
0 160405074 Joplin to Kansas City Route 2 2016-05-27-22
1 1565885487 Springfield to KC Via Hanibal 2016-05-27-22
2 1325562373 Springfield to KC Via Columbia Route 2 2016-05-27-22
3 1567254452 Saint Louis to Memphis Route2 2016-05-27-22
4 1198242881 Saint Louis to Chicago Route2 2016-05-27-22
... ... ... ...
17070 1198242881 Saint Louis to Chicago Route2 2016-06-02-20
17071 160405074 Joplin to Kansas City Route 2 2016-06-02-20
17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20
17073 1390372503 Saint Louis to Tulsa 2016-06-02-20
17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20
[17075 rows x 12 columns]