{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Caso práctico --- Exportación de un subconjunto de datos --- 6:29 min\n", "\n", "* 6:29 min | Última modificación: Octubre 6, 2021 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Este tutorial esta basado en https://es.hortonworks.com/tutorial/beginners-guide-to-apache-pig/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparación" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.set_option(\"display.notebook_repr_html\", False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Carga de los datos de los eventos de los conductores" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "0 14 25 59:21.4 Normal -94.58 37.03 \n", "1 18 16 59:21.7 Normal -89.66 39.78 \n", "2 27 105 59:21.7 Normal -90.21 38.65 \n", "3 11 74 59:21.7 Normal -90.20 38.65 \n", "4 22 87 59:21.7 Normal -90.04 35.19 \n", "\n", " eventKey CorrelationId driverName routeId \\\n", "0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074 \n", "1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487 \n", "2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373 \n", "3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452 \n", "4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881 \n", "\n", " routeName eventDate \n", "0 Joplin to Kansas City Route 2 2016-05-27-22 \n", "1 Springfield to KC Via Hanibal 2016-05-27-22 \n", "2 Springfield to KC Via Columbia Route 2 2016-05-27-22 \n", "3 Saint Louis to Memphis Route2 2016-05-27-22 \n", "4 Saint Louis to Chicago Route2 2016-05-27-22 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Carga del archivo ddesde un repo en GitHub\n", "#\n", "truck_events = pd.read_csv(\n", " \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv\",\n", ")\n", "\n", "#\n", "# Cabecera del archivo\n", "#\n", "truck_events.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtención de un subconjunto de registros" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En hive es posible un subconjunto de datos y almacenarlo en una nueva tabla a partir de una consulta que permita obtener los datos deseados. En el siguiente código, se crea la tabla `truck_events_subset` con los primeros 100 registros de la tabla `truck_events`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId truckId eventTime eventType longitude latitude \\\n", "0 14 25 59:21.4 Normal -94.58 37.03 \n", "1 18 16 59:21.7 Normal -89.66 39.78 \n", "2 27 105 59:21.7 Normal -90.21 38.65 \n", "3 11 74 59:21.7 Normal -90.20 38.65 \n", "4 22 87 59:21.7 Normal -90.04 35.19 \n", "5 22 87 59:22.3 Normal -90.37 35.21 \n", "6 23 68 59:22.4 Normal -89.91 40.86 \n", "7 11 74 59:22.5 Normal -89.74 39.10 \n", "8 20 41 59:22.5 Normal -93.36 41.69 \n", "9 32 42 59:22.5 Normal -90.37 35.21 \n", "\n", " eventKey CorrelationId driverName routeId \\\n", "0 14|25|9223370572464814373 3.660000e+18 Adis Cesir 160405074 \n", "1 18|16|9223370572464814089 3.660000e+18 Grant Liu 1565885487 \n", "2 27|105|9223370572464814070 3.660000e+18 Mark Lochbihler 1325562373 \n", "3 11|74|9223370572464814123 3.660000e+18 Jamie Engesser 1567254452 \n", "4 22|87|9223370572464814101 3.660000e+18 Nadeem Asghar 1198242881 \n", "5 22|87|9223370572464813486 3.660000e+18 Nadeem Asghar 1198242881 \n", "6 23|68|9223370572464813450 3.660000e+18 Adam Diaz 160405074 \n", "7 11|74|9223370572464813355 3.660000e+18 Jamie Engesser 1567254452 \n", "8 20|41|9223370572464813344 3.660000e+18 Chris Harris 160779139 \n", "9 32|42|9223370572464813296 3.660000e+18 Ryan Templeton 1090292248 \n", "\n", " routeName eventDate \n", "0 Joplin to Kansas City Route 2 2016-05-27-22 \n", "1 Springfield to KC Via Hanibal 2016-05-27-22 \n", "2 Springfield to KC Via Columbia Route 2 2016-05-27-22 \n", "3 Saint Louis to Memphis Route2 2016-05-27-22 \n", "4 Saint Louis to Chicago Route2 2016-05-27-22 \n", "5 Saint Louis to Chicago Route2 2016-05-27-22 \n", "6 Joplin to Kansas City Route 2 2016-05-27-22 \n", "7 Saint Louis to Memphis Route2 2016-05-27-22 \n", "8 Des Moines to Chicago Route 2 2016-05-27-22 \n", "9 Peoria to Ceder Rapids Route 2 2016-05-27-22 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "truck_events_subset = truck_events[0:10]\n", "truck_events_subset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtención de un subconjunto de columnas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "En el siguiente código se obtienen algunas columnas de la tabla `truck_events_subset` para ser almacenadas en una tabla diferente." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId eventTime eventType\n", "0 14 59:21.4 Normal\n", "1 18 59:21.7 Normal\n", "2 27 59:21.7 Normal\n", "3 11 59:21.7 Normal\n", "4 22 59:21.7 Normal\n", "5 22 59:22.3 Normal\n", "6 23 59:22.4 Normal\n", "7 11 59:22.5 Normal\n", "8 20 59:22.5 Normal\n", "9 32 59:22.5 Normal" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "specific_columns = truck_events_subset[[\"driverId\", \"eventTime\", \"eventType\"]]\n", "specific_columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtención de un subconjunto de filas y columnas" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " driverId eventTime eventType\n", "0 14 59:21.4 Normal\n", "1 18 59:21.7 Normal\n", "2 27 59:21.7 Normal\n", "3 11 59:21.7 Normal\n", "4 22 59:21.7 Normal\n", "5 22 59:22.3 Normal\n", "6 23 59:22.4 Normal\n", "7 11 59:22.5 Normal\n", "8 20 59:22.5 Normal\n", "9 32 59:22.5 Normal\n", "10 27 59:22.6 Normal" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_sub_set = truck_events.loc[0:10, [\"driverId\", \"eventTime\", \"eventType\"]]\n", "new_sub_set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Obtención de un campo de un registro en particular" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "driverId 18\n", "truckId 16\n", "eventTime 59:21.7\n", "eventType Normal\n", "longitude -89.66\n", "latitude 39.78\n", "eventKey 18|16|9223370572464814089\n", "CorrelationId 3.66e+18\n", "driverName Grant Liu\n", "routeId 1565885487\n", "routeName Springfield to KC Via Hanibal\n", "eventDate 2016-05-27-22\n", "Name: 1, dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "truck_events.iloc[1]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'18|16|9223370572464814089'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "truck_events.iloc[1].eventKey" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'18|16|9223370572464814089'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "truck_events.iloc[1][\"eventKey\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Escritura de la tabla en el disco" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "specific_columns.to_csv(\"/tmp/specific-columns.csv\", sep=\",\", header=True, index=False)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "specific-columns.csv\n" ] } ], "source": [ "#\n", "# Se visualiza el contenido del directorio\n", "#\n", "!ls /tmp/" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "driverId,eventTime,eventType\n", "14,59:21.4,Normal\n", "18,59:21.7,Normal\n", "27,59:21.7,Normal\n", "11,59:21.7,Normal\n", "22,59:21.7,Normal\n", "22,59:22.3,Normal\n", "23,59:22.4,Normal\n", "11,59:22.5,Normal\n", "20,59:22.5,Normal\n", "32,59:22.5,Normal\n" ] } ], "source": [ "#\n", "# Se visualiza la parte final del archivo\n", "#\n", "!cat /tmp/specific-columns.csv" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.7 ('.venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" }, "toc-autonumbering": false, "toc-showmarkdowntxt": false, "vscode": { "interpreter": { "hash": "99683f87c1354598af39114791cc78b29c5617e49c4bfa4c5fb7492534eeeb87" } } }, "nbformat": 4, "nbformat_minor": 4 }