{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualización de la estructura del dataframe --- 10:54 min\n", "\n", "* 10:54 min | Última modificación: Octubre 6, 2021 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas es una librería de alto desempeño para el manejo y análisis de datos en Python ampliamente utilizada en Analítica y Ciencia de Datos, por lo que su dominio resulta fundamental. Pandas se especializa en estructuras \"tidy\", es decir, tablas de datos donde cada fila es un registro y cada columna es un atributo." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#\n", "# Preparación\n", "#\n", "import pandas as pd\n", "import numpy as np\n", "\n", "pd.set_option(\"display.notebook_repr_html\", False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17075" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Carga del archivo ddesde un repo en GitHub\n", "#\n", "truck_events = pd.read_csv(\n", " \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv\",\n", " sep=\",\", \n", " thousands=None, \n", " decimal=\".\",\n", ")\n", "\n", "#\n", "# Total de registros leídos.\n", "#\n", "len(truck_events)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(17075, 12)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Tamaño (filas, columnas)\n", "#\n", "truck_events.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "driverId int64\n", "truckId int64\n", "eventTime object\n", "eventType object\n", "longitude float64\n", "latitude float64\n", "eventKey object\n", "CorrelationId float64\n", "driverName object\n", "routeId int64\n", "routeName object\n", "eventDate object\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los tipos de datos de las \n", "# columnas.\n", "#\n", "truck_events.dtypes" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['driverId', 'truckId', 'eventTime', 'eventType', 'longitude',\n", " 'latitude', 'eventKey', 'CorrelationId', 'driverName', 'routeId',\n", " 'routeName', 'eventDate'],\n", " dtype='object')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización del nombre de las columnas.\n", "#\n", "truck_events.columns" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['CorrelationId',\n", " 'driverId',\n", " 'driverName',\n", " 'eventDate',\n", " 'eventKey',\n", " 'eventTime',\n", " 'eventType',\n", " 'latitude',\n", " 'longitude',\n", " 'routeId',\n", " 'routeName',\n", " 'truckId']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Columnas ordenadas alfabeticamente.\n", "#\n", "sorted(truck_events.columns)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=17075, step=1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización del indice de las filas.\n", "#\n", "truck_events.index" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[14, 25, '59:21.4', ..., 160405074,\n", " 'Joplin to Kansas City Route 2', '2016-05-27-22'],\n", " [18, 16, '59:21.7', ..., 1565885487,\n", " 'Springfield to KC Via Hanibal', '2016-05-27-22'],\n", " [27, 105, '59:21.7', ..., 1325562373,\n", " 'Springfield to KC Via Columbia Route 2', '2016-05-27-22'],\n", " ...,\n", " [18, 49, '12:23.7', ..., 1565885487,\n", " 'Springfield to KC Via Hanibal', '2016-06-02-20'],\n", " [10, 39, '12:23.8', ..., 1390372503, 'Saint Louis to Tulsa',\n", " '2016-06-02-20'],\n", " [19, 100, '12:24.0', ..., 1962261785,\n", " 'Wichita to Little Rock.kml', '2016-06-02-20']], dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Extracción de los valores del dataframe como\n", "# una matriz de NumPy.\n", "#\n", "truck_events.values" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "0 14 25 59:21.4 Normal -94.58 37.03 \n", "1 18 16 59:21.7 Normal -89.66 39.78 \n", "2 27 105 59:21.7 Normal -90.21 38.65 \n", "3 11 74 59:21.7 Normal -90.20 38.65 \n", "4 22 87 59:21.7 Normal -90.04 35.19 \n", "\n", " eventKey CorrelationId driverName routeId \\\n", "0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074 \n", "1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487 \n", "2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373 \n", "3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452 \n", "4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881 \n", "\n", " routeName eventDate \n", "0 Joplin to Kansas City Route 2 2016-05-27-22 \n", "1 Springfield to KC Via Hanibal 2016-05-27-22 \n", "2 Springfield to KC Via Columbia Route 2 2016-05-27-22 \n", "3 Saint Louis to Memphis Route2 2016-05-27-22 \n", "4 Saint Louis to Chicago Route2 2016-05-27-22 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los primeros cinco registros\n", "# del dataframe.\n", "#\n", "truck_events.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "0 14 25 59:21.4 Normal -94.58 37.03 \n", "1 18 16 59:21.7 Normal -89.66 39.78 \n", "2 27 105 59:21.7 Normal -90.21 38.65 \n", "\n", " eventKey CorrelationId driverName routeId \\\n", "0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074 \n", "1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487 \n", "2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373 \n", "\n", " routeName eventDate \n", "0 Joplin to Kansas City Route 2 2016-05-27-22 \n", "1 Springfield to KC Via Hanibal 2016-05-27-22 \n", "2 Springfield to KC Via Columbia Route 2 2016-05-27-22 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los primeros tres registros del\n", "# dataframe.\n", "#\n", "truck_events.head(n=3)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "0 14 25 59:21.4 Normal -94.58 37.03 \n", "1 18 16 59:21.7 Normal -89.66 39.78 \n", "2 27 105 59:21.7 Normal -90.21 38.65 \n", "3 11 74 59:21.7 Normal -90.20 38.65 \n", "4 22 87 59:21.7 Normal -90.04 35.19 \n", "\n", " eventKey CorrelationId driverName routeId \\\n", "0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074 \n", "1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487 \n", "2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373 \n", "3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452 \n", "4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881 \n", "\n", " routeName eventDate \n", "0 Joplin to Kansas City Route 2 2016-05-27-22 \n", "1 Springfield to KC Via Hanibal 2016-05-27-22 \n", "2 Springfield to KC Via Columbia Route 2 2016-05-27-22 \n", "3 Saint Louis to Memphis Route2 2016-05-27-22 \n", "4 Saint Louis to Chicago Route2 2016-05-27-22 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Se puede usar un índice negativo para indicar\n", "# cuantos elementos se eliminar al final del\n", "# dataframe. Este dataframe tiene 17.075 filas.\n", "#\n", "truck_events.head(n=-17070)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "17070 11 27 12:23.1 Normal -90.20 38.65 \n", "17071 16 46 12:24.0 Normal -94.35 38.33 \n", "17072 18 49 12:23.7 Normal -90.52 39.71 \n", "17073 10 39 12:23.8 Normal -93.34 37.21 \n", "17074 19 100 12:24.0 Normal -97.37 36.79 \n", "\n", " eventKey CorrelationId driverName \\\n", "17070 11|27|9223370571956432681 1000.0 Jamie Engesser \n", "17071 16|46|9223370571956431821 1000.0 Tom McCuch \n", "17072 18|49|9223370571956432141 1000.0 Grant Liu \n", "17073 10|39|9223370571956431961 1000.0 George Vetticaden \n", "17074 19|100|9223370571956431810 1000.0 Ajay Singh \n", "\n", " routeId routeName eventDate \n", "17070 1198242881 Saint Louis to Chicago Route2 2016-06-02-20 \n", "17071 160405074 Joplin to Kansas City Route 2 2016-06-02-20 \n", "17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20 \n", "17073 1390372503 Saint Louis to Tulsa 2016-06-02-20 \n", "17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los últimos cinco registros\n", "# del dataframe.\n", "#\n", "truck_events.tail()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "17072 18 49 12:23.7 Normal -90.52 39.71 \n", "17073 10 39 12:23.8 Normal -93.34 37.21 \n", "17074 19 100 12:24.0 Normal -97.37 36.79 \n", "\n", " eventKey CorrelationId driverName \\\n", "17072 18|49|9223370571956432141 1000.0 Grant Liu \n", "17073 10|39|9223370571956431961 1000.0 George Vetticaden \n", "17074 19|100|9223370571956431810 1000.0 Ajay Singh \n", "\n", " routeId routeName eventDate \n", "17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20 \n", "17073 1390372503 Saint Louis to Tulsa 2016-06-02-20 \n", "17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los últimos tres registros del\n", "# dataframe.\n", "#\n", "truck_events.tail(n=3)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "17070 11 27 12:23.1 Normal -90.20 38.65 \n", "17071 16 46 12:24.0 Normal -94.35 38.33 \n", "17072 18 49 12:23.7 Normal -90.52 39.71 \n", "17073 10 39 12:23.8 Normal -93.34 37.21 \n", "17074 19 100 12:24.0 Normal -97.37 36.79 \n", "\n", " eventKey CorrelationId driverName \\\n", "17070 11|27|9223370571956432681 1000.0 Jamie Engesser \n", "17071 16|46|9223370571956431821 1000.0 Tom McCuch \n", "17072 18|49|9223370571956432141 1000.0 Grant Liu \n", "17073 10|39|9223370571956431961 1000.0 George Vetticaden \n", "17074 19|100|9223370571956431810 1000.0 Ajay Singh \n", "\n", " routeId routeName eventDate \n", "17070 1198242881 Saint Louis to Chicago Route2 2016-06-02-20 \n", "17071 160405074 Joplin to Kansas City Route 2 2016-06-02-20 \n", "17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20 \n", "17073 1390372503 Saint Louis to Tulsa 2016-06-02-20 \n", "17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Se puede usar un indice negativo para indicar\n", "# cuántos registros se eliminan al principio del\n", "# dataframe.\n", "#\n", "truck_events.tail(n=-17070)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "10 27 105 59:22.6 Normal -90.41 38.75 \n", "11 17 15 59:23.2 Normal -90.55 38.81 \n", "12 14 25 59:23.3 Normal -94.31 37.31 \n", "13 28 39 59:23.3 Normal -89.96 39.74 \n", "14 15 51 59:23.4 Normal -90.68 35.12 \n", "15 16 12 59:23.4 Normal -90.29 40.96 \n", "16 31 18 59:23.5 Normal -94.31 37.31 \n", "17 25 96 59:23.5 Normal -90.24 38.00 \n", "18 14 25 59:24.2 Normal -94.30 37.66 \n", "19 22 87 59:24.2 Normal -90.94 35.03 \n", "\n", " eventKey CorrelationId driverName \\\n", "10 27|105|9223370572464813205 3.660000e+18 Mark Lochbihler \n", "11 17|15|9223370572464812585 3.660000e+18 Eric Mizell \n", "12 14|25|9223370572464812526 3.660000e+18 Adis Cesir \n", "13 28|39|9223370572464812496 3.660000e+18 Olivier Renault \n", "14 15|51|9223370572464812405 3.660000e+18 Rohit Bakshi \n", "15 16|12|9223370572464812395 3.660000e+18 Tom McCuch \n", "16 31|18|9223370572464812346 3.660000e+18 Rommel Garcia \n", "17 25|96|9223370572464812336 3.660000e+18 Jean-Philippe Player \n", "18 14|25|9223370572464811655 3.660000e+18 Adis Cesir \n", "19 22|87|9223370572464811652 3.660000e+18 Nadeem Asghar \n", "\n", " routeId routeName eventDate \n", "10 1325562373 Springfield to KC Via Columbia Route 2 2016-05-27-22 \n", "11 1927624662 Springfield to KC Via Columbia 2016-05-27-22 \n", "12 160405074 Joplin to Kansas City Route 2 2016-05-27-22 \n", "13 137128276 Springfield to KC Via Hanibal Route 2 2016-05-27-22 \n", "14 1384345811 Joplin to Kansas City 2016-05-27-22 \n", "15 1961634315 Saint Louis to Memphis 2016-05-27-22 \n", "16 1594289134 Memphis to Little Rock Route 2 2016-05-27-22 \n", "17 371182829 Memphis to Little Rock 2016-05-27-22 \n", "18 160405074 Joplin to Kansas City Route 2 2016-05-27-22 \n", "19 1198242881 Saint Louis to Chicago Route2 2016-05-27-22 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de una porción intermedia del\n", "# dataframe combinando head() y tail().\n", "#\n", "truck_events.head(n=20).tail(n=10)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Joplin to Kansas City Route 2\n", "1 Springfield to KC Via Hanibal\n", "2 Springfield to KC Via Columbia Route 2\n", "3 Saint Louis to Memphis Route2\n", "4 Saint Louis to Chicago Route2\n", "Name: routeName, dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los primeros cinco registros\n", "# de una columna\n", "#\n", "truck_events.routeName.head()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Joplin to Kansas City Route 2\n", "1 Springfield to KC Via Hanibal\n", "2 Springfield to KC Via Columbia Route 2\n", "Name: routeName, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los primeros tres registros\n", "# de una columna\n", "#\n", "truck_events['routeName'].head(n=3)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17070 Saint Louis to Chicago Route2\n", "17071 Joplin to Kansas City Route 2\n", "17072 Springfield to KC Via Hanibal\n", "17073 Saint Louis to Tulsa\n", "17074 Wichita to Little Rock.kml\n", "Name: routeName, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los ultimos cinco registros\n", "# de una columna\n", "#\n", "truck_events.routeName.tail()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17070 Saint Louis to Chicago Route2\n", "17071 Joplin to Kansas City Route 2\n", "17072 Springfield to KC Via Hanibal\n", "17073 Saint Louis to Tulsa\n", "17074 Wichita to Little Rock.kml\n", "Name: routeName, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización de los ultimos cinco registros\n", "# de una columna\n", "#\n", "truck_events['routeName'].tail()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "0 14 25 59:21.4 Normal -94.58 37.03 \n", "1 18 16 59:21.7 Normal -89.66 39.78 \n", "2 27 105 59:21.7 Normal -90.21 38.65 \n", "3 11 74 59:21.7 Normal -90.20 38.65 \n", "4 22 87 59:21.7 Normal -90.04 35.19 \n", "... ... ... ... ... ... ... \n", "17070 11 27 12:23.1 Normal -90.20 38.65 \n", "17071 16 46 12:24.0 Normal -94.35 38.33 \n", "17072 18 49 12:23.7 Normal -90.52 39.71 \n", "17073 10 39 12:23.8 Normal -93.34 37.21 \n", "17074 19 100 12:24.0 Normal -97.37 36.79 \n", "\n", " eventKey CorrelationId driverName \\\n", "0 14|25|9223370572464814373 3.660000e+18 Adis Cesir \n", "1 18|16|9223370572464814089 3.660000e+18 Grant Liu \n", "2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler \n", "3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser \n", "4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar \n", "... ... ... ... \n", "17070 11|27|9223370571956432681 1.000000e+03 Jamie Engesser \n", "17071 16|46|9223370571956431821 1.000000e+03 Tom McCuch \n", "17072 18|49|9223370571956432141 1.000000e+03 Grant Liu \n", "17073 10|39|9223370571956431961 1.000000e+03 George Vetticaden \n", "17074 19|100|9223370571956431810 1.000000e+03 Ajay Singh \n", "\n", " routeId routeName eventDate \n", "0 160405074 Joplin to Kansas City Route 2 2016-05-27-22 \n", "1 1565885487 Springfield to KC Via Hanibal 2016-05-27-22 \n", "2 1325562373 Springfield to KC Via Columbia Route 2 2016-05-27-22 \n", "3 1567254452 Saint Louis to Memphis Route2 2016-05-27-22 \n", "4 1198242881 Saint Louis to Chicago Route2 2016-05-27-22 \n", "... ... ... ... \n", "17070 1198242881 Saint Louis to Chicago Route2 2016-06-02-20 \n", "17071 160405074 Joplin to Kansas City Route 2 2016-06-02-20 \n", "17072 1565885487 Springfield to KC Via Hanibal 2016-06-02-20 \n", "17073 1390372503 Saint Louis to Tulsa 2016-06-02-20 \n", "17074 1962261785 Wichita to Little Rock.kml 2016-06-02-20 \n", "\n", "[17075 rows x 12 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización del dataframe\n", "#\n", "truck_events" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc-autonumbering": false, "toc-showmarkdowntxt": false }, "nbformat": 4, "nbformat_minor": 4 }