{ "cells": [ { "cell_type": "markdown", "id": "ef69d5b2-524f-427a-8027-c4b46e3ef676", "metadata": { "tags": [] }, "source": [ "Operaciones básicas sobre archivos de datos usando Python --- 20:33 min\n", "===\n", "\n", "* 20:33 min | Última modificación: Octubre 12, 2021 " ] }, { "cell_type": "markdown", "id": "a1ebebe9-c066-4040-a039-d0d6e0538392", "metadata": {}, "source": [ "En este video se presenta un ejemplo práctico de la manipulación de archivos de datos en texto usando listas. Si bien, resulta mucho más facil realizar esta tarea usando librerías especializaxad en Python, el objetivo aquí es ejemplificar el uso de las estructuras de datos disponibles en Python puro." ] }, { "cell_type": "markdown", "id": "a04d1602-d314-4be0-8b11-41779af298bf", "metadata": {}, "source": [ "Al finalizar este tutorial, usted estará en capacidad de:\n", "\n", "* Descargar archivos de datos de internet y convertirlos a una lista de campos.\n", "\n", "* Visualizar los datos como una tabla.\n", "\n", "* Reorganizar las columnas de la tabla.\n", "\n", "* Obtener un subconjunto de registros.\n", "\n", "* Filtrar los registros.\n", "\n", "* Buscar registros.\n", "\n", "* Obtener los valores únicos por campo.\n", "\n", "* Obtener un subcojunto de las columnas.\n", "\n", "* Escribir los resultados como una tabla a un archivo." ] }, { "cell_type": "markdown", "id": "471bc4fb-c22a-458e-8670-c9eab224dd26", "metadata": { "tags": [] }, "source": [ "## Carga de datos de los eventos de los conductores" ] }, { "cell_type": "code", "execution_count": 1, "id": "afc1c461-6a42-408a-8f21-21428a66c8d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/tmp/data.csv\n", "/tmp/demo.txt\n", "/tmp/truck_event_text_partition.csv\n", "/tmp/tweets.csv\n", "/tmp/tweets.csv.1\n", "/tmp/tweets.csv.2\n", "/tmp/tweets.csv.3\n", "/tmp/tweets.csv.4\n" ] } ], "source": [ "#\n", "# Se descarga el archivo directamente del repo al disco duro.\n", "#\n", "url = \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/truck_event_text_partition.csv\"\n", "!wget --quiet {url} -P /tmp/\n", "\n", "#\n", "# Visualiza el contenido del directorio\n", "#\n", "!ls -1 /tmp/*" ] }, { "cell_type": "code", "execution_count": 2, "id": "86c5af46-4744-4544-a867-6c83f30f6866", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17076" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open(\"/tmp/truck_event_text_partition.csv\", \"r\") as file:\n", " truck_events = file.readlines()\n", "\n", "#\n", "# Cantidad de filas del archivo incluyendo la cabecera.\n", "#\n", "len(truck_events)" ] }, { "cell_type": "code", "execution_count": 3, "id": "95c0df84-dabc-40c0-873b-45c22dd2922e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate\\n',\n", " '14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22\\n']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización del contenido\n", "#\n", "truck_events[0:2]" ] }, { "cell_type": "code", "execution_count": 4, "id": "527d7502-43bc-41c6-bc4a-c609ab06b6f9", "metadata": {}, "outputs": [], "source": [ "#\n", "# Limpieza\n", "#\n", "truck_events = [line.replace(\"\\n\", \"\") for line in truck_events]" ] }, { "cell_type": "code", "execution_count": 5, "id": "24ee9815-1f6e-4983-b733-2b474005845a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['driverId,truckId,eventTime,eventType,longitude,latitude,eventKey,CorrelationId,driverName,routeId,routeName,eventDate',\n", " '14,25,59:21.4,Normal,-94.58,37.03,14|25|9223370572464814373,3.66E+18,Adis Cesir,160405074,Joplin to Kansas City Route 2,2016-05-27-22']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Visualización del contenido\n", "#\n", "truck_events[0:2]" ] }, { "cell_type": "code", "execution_count": 6, "id": "17c16a57-0d2e-4394-ab3c-c7eb37772427", "metadata": {}, "outputs": [], "source": [ "#\n", "# Conversión de los strings a listas\n", "#\n", "truck_events = [line.split(\",\") for line in truck_events]" ] }, { "cell_type": "code", "execution_count": 7, "id": "d58e14cc-7f1e-4f6e-aa6d-e3e38f6030de", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['driverId',\n", " 'truckId',\n", " 'eventTime',\n", " 'eventType',\n", " 'longitude',\n", " 'latitude',\n", " 'eventKey',\n", " 'CorrelationId',\n", " 'driverName',\n", " 'routeId',\n", " 'routeName',\n", " 'eventDate']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# Nombres de las columnas\n", "#\n", "truck_events[0]" ] }, { "cell_type": "markdown", "id": "da440dfe-bb90-4518-9aec-8d23229ae9da", "metadata": {}, "source": [ "## Visualización de datos" ] }, { "cell_type": "code", "execution_count": 8, "id": "ecc7e556-ab61-47ce-945b-5543c32346c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['driverId',\n", " 'truckId',\n", " 'eventTime',\n", " 'eventType',\n", " 'longitude',\n", " 'latitude',\n", " 'eventKey',\n", " 'CorrelationId',\n", " 'driverName',\n", " 'routeId',\n", " 'routeName',\n", " 'eventDate'],\n", " ['14',\n", " '25',\n", " '59:21.4',\n", " 'Normal',\n", " '-94.58',\n", " '37.03',\n", " '14|25|9223370572464814373',\n", " '3.66E+18',\n", " 'Adis Cesir',\n", " '160405074',\n", " 'Joplin to Kansas City Route 2',\n", " '2016-05-27-22']]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "# La visualización de datos como una lista de listas\n", "# resulta inapropiada\n", "#\n", "truck_events[:2]" ] }, { "cell_type": "code", "execution_count": 9, "id": "62c62ea0-2b84-4571-b8cb-fefcc3b15aed", "metadata": {}, "outputs": [], "source": [ "def pprint(data):\n", " #\n", " def get_format_string(data):\n", " format_string = []\n", " for i_col in range(len(data[0])):\n", " lengths = [len(str(row[i_col])) for row in data]\n", " max_length = max(lengths)\n", " #\n", " # Por ejemplo: \"{:>10s}\"\n", " #\n", " format_string.append(\"{:>\" + str(max_length) + \"s}\")\n", " return format_string\n", "\n", " #\n", " def print_data(format_string, data):\n", " format_string = get_format_string(data)\n", " for index, row in enumerate(data):\n", " text = \" \" if index == 0 else \"{:2d} \".format(index - 1)\n", " for fmt, value in zip(format_string, row):\n", " text += fmt.format(str(value)) + \" \"\n", " if len(text) >= 106:\n", " text = text[:100] + \" [...]\"\n", " print(text)\n", "\n", " #\n", " format_string = get_format_string(data)\n", " print_data(format_string, data)" ] }, { "cell_type": "code", "execution_count": 10, "id": "72b74f52-03d6-416a-918a-e418a8e353cd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId truckId eventTime eventType longitude latitude eventKey CorrelationId [...]\n", " 0 14 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 3.66E+18 [...]\n", " 1 18 16 59:21.7 Normal -89.66 39.78 18|16|9223370572464814089 3.66E+18 [...]\n", " 2 27 105 59:21.7 Normal -90.21 38.65 27|105|9223370572464814070 3.66E+18 [...]\n", " 3 11 74 59:21.7 Normal -90.2 38.65 11|74|9223370572464814123 3.66E+18 [...]\n" ] } ], "source": [ "pprint(truck_events[:5])" ] }, { "cell_type": "markdown", "id": "ee8fc5cf-24a0-4df6-8b74-b0586134a867", "metadata": {}, "source": [ "## Reorganización de las columnas" ] }, { "cell_type": "code", "execution_count": 11, "id": "4526ec74-c2dc-4b83-ab79-51b7c5dfdf4b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventK [...]\n", " 0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|92233705724648143 [...]\n", " 1 18 Grant Liu 16 59:21.7 Normal -89.66 39.78 18|16|92233705724648140 [...]\n", " 2 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]\n", " 3 11 Jamie Engesser 74 59:21.7 Normal -90.2 38.65 11|74|92233705724648141 [...]\n", " 4 22 Nadeem Asghar 87 59:21.7 Normal -90.04 35.19 22|87|92233705724648141 [...]\n", " 5 22 Nadeem Asghar 87 59:22.3 Normal -90.37 35.21 22|87|92233705724648134 [...]\n", " 6 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|92233705724648134 [...]\n", " 7 11 Jamie Engesser 74 59:22.5 Normal -89.74 39.1 11|74|92233705724648133 [...]\n", " 8 20 Chris Harris 41 59:22.5 Normal -93.36 41.69 20|41|92233705724648133 [...]\n" ] } ], "source": [ "truck_events = [\n", " [row[0], row[8]] + [row[index] for index in range(1, 12) if index != 8]\n", " for row in truck_events\n", "]\n", "pprint(truck_events[:10])" ] }, { "cell_type": "markdown", "id": "66f6b6a9-36a8-4893-9c5d-592d7f9fa7b5", "metadata": {}, "source": [ "## Obtención de un subconjunto de registros" ] }, { "cell_type": "code", "execution_count": 12, "id": "008ed747-e4fb-4342-8a2e-4309414e73ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventK [...]\n", " 0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|92233705724648143 [...]\n", " 1 18 Grant Liu 16 59:21.7 Normal -89.66 39.78 18|16|92233705724648140 [...]\n", " 2 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]\n", " 3 11 Jamie Engesser 74 59:21.7 Normal -90.2 38.65 11|74|92233705724648141 [...]\n", " 4 22 Nadeem Asghar 87 59:21.7 Normal -90.04 35.19 22|87|92233705724648141 [...]\n", " 5 22 Nadeem Asghar 87 59:22.3 Normal -90.37 35.21 22|87|92233705724648134 [...]\n", " 6 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|92233705724648134 [...]\n", " 7 11 Jamie Engesser 74 59:22.5 Normal -89.74 39.1 11|74|92233705724648133 [...]\n", " 8 20 Chris Harris 41 59:22.5 Normal -93.36 41.69 20|41|92233705724648133 [...]\n" ] } ], "source": [ "truck_events_subset = truck_events[0:10]\n", "pprint(truck_events_subset)" ] }, { "cell_type": "markdown", "id": "66860ad0-b39c-4680-9ada-5758388000bf", "metadata": { "tags": [] }, "source": [ "## Filtrado de registros" ] }, { "cell_type": "code", "execution_count": 13, "id": "27e503d4-6fee-451a-a949-8e6a6031fd93", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventKey Cor [...]\n", " 0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 [...]\n", " 1 14 Adis Cesir 25 59:23.3 Normal -94.31 37.31 14|25|9223370572464812526 [...]\n", " 2 14 Adis Cesir 25 59:24.2 Normal -94.3 37.66 14|25|9223370572464811655 [...]\n", " 3 14 Adis Cesir 25 59:34.0 Normal -94.3 37.66 14|25|9223370572464801796 [...]\n", " 4 14 Adis Cesir 25 59:35.8 Normal -94.46 37.16 14|25|9223370572464800006 [...]\n", " 5 14 Adis Cesir 25 59:53.3 Normal -94.58 37.03 14|25|9223370572464782555 [...]\n", " 6 14 Adis Cesir 25 59:54.0 Normal -94.46 37.16 14|25|9223370572464781805 [...]\n", " 7 14 Adis Cesir 25 59:57.5 Normal -94.35 38.33 14|25|9223370572464778335 [...]\n", " 8 14 Adis Cesir 25 59:59.2 Normal -94.38 38.99 14|25|9223370572464776636 [...]\n" ] } ], "source": [ "truck_events_driverId_14 = [truck_events[0]] + [\n", " row for row in truck_events if row[0] == \"14\"\n", "]\n", "pprint(truck_events_driverId_14[:10])" ] }, { "cell_type": "markdown", "id": "7beb3182-3ddc-478b-8499-fb7541fd51f7", "metadata": {}, "source": [ "## Búsqueda de registros usando expresiones regulares" ] }, { "cell_type": "code", "execution_count": 14, "id": "6334cb2e-9616-4cc4-ab2f-2603fe2c5871", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventKey Cor [...]\n", " 0 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 [...]\n", " 1 23 Adam Diaz 68 59:26.6 Normal -91.32 41.71 23|68|9223370572464809216 [...]\n", " 2 23 Adam Diaz 68 59:27.4 Normal -91.47 41.74 23|68|9223370572464808375 [...]\n", " 3 23 Adam Diaz 68 59:28.4 Normal -91.63 41.72 23|68|9223370572464807444 [...]\n", " 4 23 Adam Diaz 68 59:29.9 Normal -91.78 42.23 23|68|9223370572464805905 [...]\n", " 5 23 Adam Diaz 68 59:30.8 Normal -91.63 41.72 23|68|9223370572464804995 [...]\n", " 6 23 Adam Diaz 68 59:37.0 Normal -89.91 40.86 23|68|9223370572464798855 [...]\n", " 7 23 Adam Diaz 68 59:39.6 Normal -89.91 40.86 23|68|9223370572464796242 [...]\n", " 8 23 Adam Diaz 68 59:40.4 Normal -90.29 40.96 23|68|9223370572464795396 [...]\n" ] } ], "source": [ "import re\n", "\n", "name_finished_in_z = [truck_events[0]] + [\n", " row for row in truck_events if re.search(\"z$\", row[1])\n", "]\n", "pprint(name_finished_in_z[:10])" ] }, { "cell_type": "code", "execution_count": 15, "id": "07090ea8-aaa4-4851-bc3e-dbd7cb783f76", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventK [...]\n", " 0 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]\n", " 1 27 Mark Lochbihler 105 59:22.6 Normal -90.41 38.75 27|105|92233705724648132 [...]\n", " 2 27 Mark Lochbihler 105 59:25.9 Normal -90.93 38.82 27|105|92233705724648098 [...]\n", " 3 27 Mark Lochbihler 105 59:27.7 Normal -91.19 38.83 27|105|92233705724648081 [...]\n", " 4 27 Mark Lochbihler 105 59:29.3 Normal -91.56 38.93 27|105|92233705724648065 [...]\n", " 5 27 Mark Lochbihler 105 59:35.6 Normal -92.85 38.93 27|105|92233705724648001 [...]\n", " 6 27 Mark Lochbihler 105 59:50.9 Normal -93.2 38.98 27|105|92233705724647849 [...]\n", " 7 27 Mark Lochbihler 105 59:51.8 Normal -93.01 38.97 27|105|92233705724647839 [...]\n", " 8 27 Mark Lochbihler 105 59:53.8 Normal -92.63 38.95 27|105|92233705724647820 [...]\n" ] } ], "source": [ "name_begins_with_Ma = [truck_events[0]] + [\n", " row for row in truck_events if re.search(\"^Ma\", row[1])\n", "]\n", "pprint(name_begins_with_Ma[:10])" ] }, { "cell_type": "code", "execution_count": 16, "id": "6f38bfaa-2beb-4739-a219-06502da2e554", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventK [...]\n", " 0 27 Mark Lochbihler 105 59:21.7 Normal -90.21 38.65 27|105|92233705724648140 [...]\n", " 1 27 Mark Lochbihler 105 59:22.6 Normal -90.41 38.75 27|105|92233705724648132 [...]\n", " 2 16 Tom McCuch 12 59:23.4 Normal -90.29 40.96 16|12|92233705724648123 [...]\n", " 3 26 Michael Aube 57 59:25.2 Normal -90.86 38.46 26|57|92233705724648106 [...]\n", " 4 16 Tom McCuch 12 59:25.3 Normal -90.7 41.62 16|12|92233705724648105 [...]\n", " 5 27 Mark Lochbihler 105 59:25.9 Normal -90.93 38.82 27|105|92233705724648098 [...]\n", " 6 26 Michael Aube 57 59:27.0 Normal -91.18 38.22 26|57|92233705724648087 [...]\n", " 7 27 Mark Lochbihler 105 59:27.7 Normal -91.19 38.83 27|105|92233705724648081 [...]\n", " 8 16 Tom McCuch 12 59:28.8 Normal -91.63 41.72 16|12|92233705724648070 [...]\n" ] } ], "source": [ "name_contains_ch = [truck_events[0]] + [\n", " row for row in truck_events if re.search(\"ch\", row[1])\n", "]\n", "pprint(name_contains_ch[:10])" ] }, { "cell_type": "code", "execution_count": 17, "id": "f53679cf-e2a9-4a25-852b-f37785401468", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventKe [...]\n", " 0 22 Nadeem Asghar 87 59:21.7 Normal -90.04 35.19 22|87|922337057246481410 [...]\n", " 1 22 Nadeem Asghar 87 59:22.3 Normal -90.37 35.21 22|87|922337057246481348 [...]\n", " 2 32 Ryan Templeton 42 59:22.5 Normal -90.37 35.21 32|42|922337057246481329 [...]\n", " 3 16 Tom McCuch 12 59:23.4 Normal -90.29 40.96 16|12|922337057246481239 [...]\n", " 4 22 Nadeem Asghar 87 59:24.2 Normal -90.94 35.03 22|87|922337057246481165 [...]\n", " 5 32 Ryan Templeton 42 59:24.2 Normal -90.94 35.03 32|42|922337057246481159 [...]\n", " 6 22 Nadeem Asghar 87 59:25.0 Normal -91.14 34.96 22|87|922337057246481080 [...]\n", " 7 16 Tom McCuch 12 59:25.3 Normal -90.7 41.62 16|12|922337057246481053 [...]\n", " 8 21 Jeff Markham 109 59:25.7 Normal -94.35 38.33 21|109|922337057246481009 [...]\n" ] } ], "source": [ "name_not_contains_i = [truck_events[0]] + [\n", " row for row in truck_events if not re.search(\"i\", row[1])\n", "]\n", "pprint(name_not_contains_i[:10])" ] }, { "cell_type": "code", "execution_count": 18, "id": "5991bd1c-78d9-49c9-8dbd-26df4b46f377", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventKey [...]\n", " 0 14 Adis Cesir 25 59:21.4 Normal -94.58 37.03 14|25|9223370572464814373 [...]\n", " 1 18 Grant Liu 16 59:21.7 Normal -89.66 39.78 18|16|9223370572464814089 [...]\n", " 2 11 Jamie Engesser 74 59:21.7 Normal -90.2 38.65 11|74|9223370572464814123 [...]\n", " 3 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 [...]\n", " 4 11 Jamie Engesser 74 59:22.5 Normal -89.74 39.1 11|74|9223370572464813355 [...]\n", " 5 20 Chris Harris 41 59:22.5 Normal -93.36 41.69 20|41|9223370572464813344 [...]\n", " 6 32 Ryan Templeton 42 59:22.5 Normal -90.37 35.21 32|42|9223370572464813296 [...]\n", " 7 17 Eric Mizell 15 59:23.2 Normal -90.55 38.81 17|15|9223370572464812585 [...]\n", " 8 14 Adis Cesir 25 59:23.3 Normal -94.31 37.31 14|25|9223370572464812526 [...]\n" ] } ], "source": [ "name_not_begins_with_MN = [truck_events[0]] + [\n", " row for row in truck_events[1:] if not re.search(\"^[MN]\", row[1])\n", "]\n", "pprint(name_not_begins_with_MN[:10])" ] }, { "cell_type": "markdown", "id": "1be24a39-3f20-40bd-bee2-e73e270cc009", "metadata": {}, "source": [ "## Borrado de registros duplicados" ] }, { "cell_type": "code", "execution_count": 19, "id": "2769ce41-1a71-4659-bef3-ff922b09fc72", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude event [...]\n", " 0 15 Rohit Bakshi 92 36:19.8 Normal -91.14 34.96 15|92|9223370572419395 [...]\n", " 1 31 Rommel Garcia 86 37:55.2 Normal -91.93 34.81 31|86|9223370572419300 [...]\n", " 2 14 Adis Cesir 56 36:35.7 Normal -90.29 40.96 14|56|9223370572419380 [...]\n", " 3 29 Teddy Choi 68 36:18.5 Normal -95.56 35.97 29|68|9223370572419397 [...]\n", " 4 10 George Vetticaden 39 12:03.0 Normal -93.69 37.16 10|39|9223370571956452 [...]\n", " 5 27 Mark Lochbihler 60 36:51.2 Normal -91.19 38.83 27|60|9223370572419364 [...]\n", " 6 11 Jamie Engesser 39 38:04.8 Normal -90.2 41.59 11|39|9223370572419290 [...]\n", " 7 14 Adis Cesir 18 11:46.7 Normal -91.47 41.74 14|18|9223370571956469 [...]\n", " 8 30 Dan Rice 95 09:32.7 Normal -91.59 41.7 30|95|9223370571956603 [...]\n" ] } ], "source": [ "truck_events_as_strings = [\",\".join(row) for row in truck_events[1:]]\n", "unique_truck_events = list(set(truck_events_as_strings))\n", "unique_truck_events = [row.split(\",\") for row in unique_truck_events]\n", "unique_truck_events = [truck_events[0]] + unique_truck_events\n", "pprint(unique_truck_events[:10])" ] }, { "cell_type": "markdown", "id": "a3ee829c-fcd6-4992-9d92-24dd94756db5", "metadata": {}, "source": [ "## Valores únicos por campo" ] }, { "cell_type": "code", "execution_count": 20, "id": "3fdd8408-84de-4e2c-97c9-f5ab879bd795", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Adam Diaz',\n", " 'Adis Cesir',\n", " 'Ajay Singh',\n", " 'Chris Harris',\n", " 'Dan Rice',\n", " 'Don Hilborn',\n", " 'Eric Mizell',\n", " 'George Vetticaden',\n", " 'Grant Liu',\n", " 'Jamie Engesser',\n", " 'Jean-Philippe Player',\n", " 'Jeff Markham',\n", " 'Joe Niemiec',\n", " 'Mark Lochbihler',\n", " 'Michael Aube',\n", " 'Nadeem Asghar',\n", " 'Olivier Renault',\n", " 'Paul Codding',\n", " 'Rohit Bakshi',\n", " 'Rommel Garcia',\n", " 'Ryan Templeton',\n", " 'Teddy Choi',\n", " 'Tom McCuch']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "driverNames = sorted({row[1] for row in truck_events[1:]})\n", "driverNames" ] }, { "cell_type": "markdown", "id": "e5e68753-537c-4b42-8294-8c39dfd5cca7", "metadata": {}, "source": [ "## Ordenamiento de registros con base en un campo" ] }, { "cell_type": "code", "execution_count": 21, "id": "8a273b1d-c4df-4edb-98d3-7c0e606688b0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId driverName truckId eventTime eventType longitude latitude eventKey Cor [...]\n", " 0 23 Adam Diaz 68 59:22.4 Normal -89.91 40.86 23|68|9223370572464813450 [...]\n", " 1 23 Adam Diaz 68 59:26.6 Normal -91.32 41.71 23|68|9223370572464809216 [...]\n", " 2 23 Adam Diaz 68 59:27.4 Normal -91.47 41.74 23|68|9223370572464808375 [...]\n", " 3 23 Adam Diaz 68 59:28.4 Normal -91.63 41.72 23|68|9223370572464807444 [...]\n", " 4 23 Adam Diaz 68 59:29.9 Normal -91.78 42.23 23|68|9223370572464805905 [...]\n", " 5 23 Adam Diaz 68 59:30.8 Normal -91.63 41.72 23|68|9223370572464804995 [...]\n", " 6 23 Adam Diaz 68 59:37.0 Normal -89.91 40.86 23|68|9223370572464798855 [...]\n", " 7 23 Adam Diaz 68 59:39.6 Normal -89.91 40.86 23|68|9223370572464796242 [...]\n", " 8 23 Adam Diaz 68 59:40.4 Normal -90.29 40.96 23|68|9223370572464795396 [...]\n" ] } ], "source": [ "from operator import itemgetter\n", "\n", "sorted_truck_events = [truck_events[0]] + sorted(\n", " [row for row in truck_events[1:]], key=itemgetter(1)\n", ")\n", "pprint(sorted_truck_events[:10])" ] }, { "cell_type": "markdown", "id": "114b1930-7bff-4a6d-8bbf-7478f71b4666", "metadata": {}, "source": [ "## Obtención de un subconjunto de columnas" ] }, { "cell_type": "code", "execution_count": 22, "id": "a1c2f5e0-12d8-4800-8458-7974945e9394", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId eventTime eventType \n", " 0 14 59:21.4 Normal \n", " 1 18 59:21.7 Normal \n", " 2 27 59:21.7 Normal \n", " 3 11 59:21.7 Normal \n", " 4 22 59:21.7 Normal \n", " 5 22 59:22.3 Normal \n", " 6 23 59:22.4 Normal \n", " 7 11 59:22.5 Normal \n", " 8 20 59:22.5 Normal \n" ] } ], "source": [ "#\n", "# Seleccione las columnas \"driverId\", \"eventTime\", \"eventType\"\n", "# de la variable truck_events_subset\n", "#\n", "\n", "column_indexes = [\n", " i_col\n", " for i_col, colname in enumerate(truck_events_subset[0])\n", " if colname in [\"driverId\", \"eventTime\", \"eventType\"]\n", "]\n", "\n", "specific_columns = [\n", " [col for i_col, col in enumerate(row) if i_col in column_indexes]\n", " for row in truck_events_subset\n", "]\n", "\n", "pprint(specific_columns)" ] }, { "cell_type": "markdown", "id": "40ba6702-53cc-4757-ae4b-adfca9f0d543", "metadata": {}, "source": [ "## Escritura al disco como una tabla" ] }, { "cell_type": "code", "execution_count": 23, "id": "2213d5d1-28aa-4281-b3da-594d600c1b15", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'driverId,eventTime,eventType\\n14,59:21.4,Normal\\n18,59:21.7,Normal\\n27,59:21.7,Normal\\n11,59:21.7,Normal\\n22,59:21.7,Normal\\n22,59:22.3,Normal\\n23,59:22.4,Normal\\n11,59:22.5,Normal\\n20,59:22.5,Normal'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "specific_columns = [\",\".join(row) for row in specific_columns]\n", "specific_columns = \"\\n\".join(specific_columns)\n", "specific_columns" ] }, { "cell_type": "code", "execution_count": 24, "id": "e0a76aad-bc47-4c66-a787-851b3f5b2be8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "driverId,eventTime,eventType\n", "14,59:21.4,Normal\n", "18,59:21.7,Normal\n", "27,59:21.7,Normal\n", "11,59:21.7,Normal\n", "22,59:21.7,Normal\n", "22,59:22.3,Normal\n", "23,59:22.4,Normal\n", "11,59:22.5,Normal\n", "20,59:22.5,Normal\n" ] } ], "source": [ "with open(\"/tmp/specific_columns.csv\", \"w\") as file:\n", " print(specific_columns, file=file)\n", "\n", "!cat /tmp/specific_columns.csv" ] }, { "cell_type": "code", "execution_count": 25, "id": "e6aadc47-d04c-4308-81a3-7739d61e3c30", "metadata": {}, "outputs": [], "source": [ "!rm /tmp/*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.7 ('.venv': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.7" }, "vscode": { "interpreter": { "hash": "99683f87c1354598af39114791cc78b29c5617e49c4bfa4c5fb7492534eeeb87" } } }, "nbformat": 4, "nbformat_minor": 5 }