{ "cells": [ { "cell_type": "markdown", "id": "fe2f6570-5bd1-4972-8e18-546dee1a5972", "metadata": { "tags": [] }, "source": [ "Procesamiento básico de datos usando Python --- 12:22 min\n", "===\n", "\n", "* 12:22 min | Última modificación: Octubre 12, 2021 " ] }, { "cell_type": "markdown", "id": "a2ae3369-6149-4c94-a8b0-93673cba17b8", "metadata": {}, "source": [ "En este tutorial se explica como realizar el procedimiento básico de datos usando Python." ] }, { "cell_type": "markdown", "id": "bd9c8f4b-b634-4330-98ba-71ee0a6d184f", "metadata": {}, "source": [ "Al finalizar este tutorial, usted estará en capacidad de:\n", "\n", "* Descargar archivos de datos desde internet y cargarlos como lista.\n", "\n", "* Usar la función groupby de la librería itertools.\n", "\n", "* Unir dos conjuntos de datos usando un campo clave.\n", "\n", "* Ordenar un conjunto de datos.\n", "\n", "* Búscar los registros que contienen el valor máximo o mínimo de un campo.\n", "\n", "* Escribir al disco duro los resultados." ] }, { "cell_type": "markdown", "id": "ce170a89-7d35-4a69-a01e-b506e647cf7e", "metadata": {}, "source": [ "## Descarga de datos" ] }, { "cell_type": "code", "execution_count": 1, "id": "375edeac-df4c-4ea9-8755-32976ec33f6b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/tmp/drivers.csv\n", "/tmp/timesheet.csv\n" ] } ], "source": [ "url_drivers = \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv\"\n", "!wget --quiet {url_drivers} -P /tmp/\n", "\n", "url_timesheet = \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv\"\n", "!wget --quiet {url_timesheet} -P /tmp/\n", "\n", "!ls -1 /tmp/*" ] }, { "cell_type": "markdown", "id": "91baa24c-b461-41cf-890e-488c52968f02", "metadata": { "tags": [] }, "source": [ "## Impresion" ] }, { "cell_type": "code", "execution_count": 2, "id": "7c171b93-eeae-439b-a2b5-b6fa47363013", "metadata": {}, "outputs": [], "source": [ "def pprint(data):\n", " #\n", " def get_format_string(data):\n", " format_string = []\n", " for i_col in range(len(data[0])):\n", " lengths = [len(str(row[i_col])) for row in data]\n", " max_length = max(lengths)\n", " #\n", " # Por ejemplo: \"{:>10s}\"\n", " #\n", " format_string.append(\"{:>\" + str(max_length) + \"s}\")\n", " return format_string\n", "\n", " #\n", " def print_data(format_string, data):\n", " format_string = get_format_string(data)\n", " for index, row in enumerate(data):\n", " text = \" \" if index == 0 else \"{:2d} \".format(index - 1)\n", " for fmt, value in zip(format_string, row):\n", " text += fmt.format(str(value)) + \" \"\n", " if len(text) >= 106:\n", " text = text[:100] + \" [...]\"\n", " print(text)\n", "\n", " #\n", " format_string = get_format_string(data)\n", " print_data(format_string, data)" ] }, { "cell_type": "markdown", "id": "5033b60b-d867-49f5-8df9-88cdf903e177", "metadata": {}, "source": [ "## Creación de la tabla drivers" ] }, { "cell_type": "code", "execution_count": 3, "id": "6c392f66-9346-4779-9490-428e0d05da40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId name \n", " 0 10 George Vetticaden \n", " 1 11 Jamie Engesser \n", " 2 12 Paul Coddin \n", " 3 13 Joe Niemiec \n", " 4 14 Adis Cesir \n", " 5 15 Rohit Bakshi \n", " 6 16 Tom McCuch \n", " 7 17 Eric Mizell \n", " 8 18 Grant Liu \n" ] } ], "source": [ "with open(\"/tmp/drivers.csv\", \"r\") as file:\n", " drivers = file.readlines()\n", "\n", "drivers = [row.replace(\"\\n\", \"\") for row in drivers]\n", "drivers = [row.split(\",\") for row in drivers]\n", "drivers = [row[:2] for row in drivers]\n", "pprint(drivers[0:10])" ] }, { "cell_type": "markdown", "id": "c636746e-91c8-4d37-8127-408b7b6bbab4", "metadata": {}, "source": [ "## Creación de la tabla timesheet" ] }, { "cell_type": "code", "execution_count": 4, "id": "f225cd42-b0fc-4046-996d-26b9ad614773", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId week hours-logged miles-logged \n", " 0 10 1 70 3300 \n", " 1 10 2 70 3300 \n", " 2 10 3 60 2800 \n", " 3 10 4 70 3100 \n", " 4 10 5 70 3200 \n", " 5 10 6 70 3300 \n", " 6 10 7 70 3000 \n", " 7 10 8 70 3300 \n", " 8 10 9 70 3200 \n" ] } ], "source": [ "with open(\"/tmp/timesheet.csv\", \"r\") as file:\n", " timesheet = file.readlines()\n", "\n", "timesheet = [row.replace(\"\\n\", \"\") for row in timesheet]\n", "timesheet = [row.split(\",\") for row in timesheet]\n", "pprint(timesheet[:10])" ] }, { "cell_type": "code", "execution_count": 5, "id": "1e56dc0a-5a22-4fc9-a1f8-3c181fe39e70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId week hours-logged miles-logged \n", " 0 10 1 70 3300 \n", " 1 10 2 70 3300 \n", " 2 10 3 60 2800 \n", " 3 10 4 70 3100 \n", " 4 10 5 70 3200 \n", " 5 10 6 70 3300 \n", " 6 10 7 70 3000 \n", " 7 10 8 70 3300 \n", " 8 10 9 70 3200 \n" ] } ], "source": [ "#\n", "# Typecast de datos\n", "#\n", "timesheet = [\n", " [int(field) if i_row > 0 else field for field in row]\n", " for i_row, row in enumerate(timesheet)\n", "]\n", "pprint(timesheet[:10])" ] }, { "cell_type": "markdown", "id": "d0b19066-bc57-4359-89d1-a7bb33d3b3c9", "metadata": { "tags": [] }, "source": [ "## Cantidad de horas y millas de cada conductor por año" ] }, { "cell_type": "code", "execution_count": 6, "id": "d660e743-4da7-4eca-aa9f-14276b12260c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", " [10, 1, 70, 3300]\n", " [10, 2, 70, 3300]\n", " [10, 3, 60, 2800]\n", " [10, 4, 70, 3100]\n", " [10, 5, 70, 3200]\n", " ...\n", "11\n", " [11, 1, 50, 3000]\n", " [11, 2, 83, 4000]\n", " [11, 3, 80, 4000]\n", " [11, 4, 85, 4000]\n", " [11, 5, 90, 4100]\n", " ...\n", "12\n", " [12, 1, 49, 2783]\n", " [12, 2, 50, 2505]\n", " [12, 3, 51, 2577]\n", " [12, 4, 54, 2743]\n", " [12, 5, 47, 2791]\n", " ...\n", "13\n", " [13, 1, 49, 2643]\n", " [13, 2, 56, 2553]\n", " [13, 3, 60, 2539]\n", " [13, 4, 55, 2553]\n", " [13, 5, 45, 2762]\n", " ...\n", "...\n" ] } ], "source": [ "import itertools\n", "from operator import itemgetter\n", "\n", "for i_key, (key, group) in enumerate(\n", " itertools.groupby(\n", " timesheet[1:],\n", " itemgetter(0),\n", " )\n", "):\n", " print(key)\n", "\n", " for i_grp, grp in enumerate(group):\n", " print(\" \", grp)\n", " if i_grp > 3:\n", " print(\" ...\")\n", " break\n", "\n", " if i_key > 2:\n", " print(\"...\")\n", " break" ] }, { "cell_type": "code", "execution_count": 7, "id": "3cad0dcb-aee2-4674-bcc7-d3ecfa866a97", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId hours-logged miles-logged \n", " 0 10 3232 147150 \n", " 1 11 3642 179300 \n", " 2 12 2639 135962 \n", " 3 13 2727 134126 \n", " 4 14 2781 136624 \n", " 5 15 2734 138750 \n", " 6 16 2746 137205 \n", " 7 17 2701 135992 \n", " 8 18 2654 137834 \n", " 9 19 2738 137968 \n", "10 20 2644 134564 \n", "11 21 2751 138719 \n", "12 22 2733 137550 \n", "13 23 2750 137980 \n", "14 24 2647 134461 \n", "15 25 2723 139180 \n", "16 26 2730 137530 \n", "17 27 2771 137922 \n", "18 28 2723 137469 \n", "19 29 2760 138255 \n", "20 30 2773 137473 \n", "21 31 2704 137057 \n", "22 32 2736 137422 \n", "23 33 2759 139285 \n", "24 34 2811 137728 \n", "25 35 2728 138727 \n", "26 36 2795 138025 \n", "27 37 2694 137223 \n", "28 38 2760 137464 \n", "29 39 2745 138788 \n", "30 40 2700 136931 \n", "31 41 2723 138407 \n", "32 42 2697 136673 \n", "33 43 2750 136993 \n" ] } ], "source": [ "timesheet_grouped_by_driverId = {\n", " driverId: list(group)\n", " for driverId, group in itertools.groupby(\n", " timesheet[1:],\n", " key=itemgetter(0),\n", " )\n", "}\n", "\n", "sum_timesheet = [\n", " [\n", " driverId,\n", " sum([row[2] for row in timesheet_grouped_by_driverId[driverId]]),\n", " sum([row[3] for row in timesheet_grouped_by_driverId[driverId]]),\n", " ]\n", " for driverId in timesheet_grouped_by_driverId.keys()\n", "]\n", "\n", "sum_timesheet = [[\"driverId\", \"hours-logged\", \"miles-logged\"]] + sum_timesheet\n", "\n", "pprint(sum_timesheet)" ] }, { "cell_type": "markdown", "id": "bdb723b8-533c-4eff-b64f-bfbf2debe4db", "metadata": { "tags": [] }, "source": [ "## Unión de las tablas" ] }, { "cell_type": "code", "execution_count": 8, "id": "37e2fb01-9847-4f65-9f32-df3a29596875", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId name hours-logged miles-logged \n", " 0 10 George Vetticaden 3232 147150 \n", " 1 11 Jamie Engesser 3642 179300 \n", " 2 12 Paul Coddin 2639 135962 \n", " 3 13 Joe Niemiec 2727 134126 \n", " 4 14 Adis Cesir 2781 136624 \n", " 5 15 Rohit Bakshi 2734 138750 \n", " 6 16 Tom McCuch 2746 137205 \n", " 7 17 Eric Mizell 2701 135992 \n", " 8 18 Grant Liu 2654 137834 \n", " 9 19 Ajay Singh 2738 137968 \n", "10 20 Chris Harris 2644 134564 \n", "11 21 Jeff Markham 2751 138719 \n", "12 22 Nadeem Asghar 2733 137550 \n", "13 23 Adam Diaz 2750 137980 \n", "14 24 Don Hilborn 2647 134461 \n", "15 25 Jean-Philippe Playe 2723 139180 \n", "16 26 Michael Aube 2730 137530 \n", "17 27 Mark Lochbihler 2771 137922 \n", "18 28 Olivier Renault 2723 137469 \n", "19 29 Teddy Choi 2760 138255 \n", "20 30 Dan Rice 2773 137473 \n", "21 31 Rommel Garcia 2704 137057 \n", "22 32 Ryan Templeton 2736 137422 \n", "23 33 Sridhara Sabbella 2759 139285 \n", "24 34 Frank Romano 2811 137728 \n", "25 35 Emil Siemes 2728 138727 \n", "26 36 Andrew Grande 2795 138025 \n", "27 37 Wes Floyd 2694 137223 \n", "28 38 Scott Shaw 2760 137464 \n", "29 39 David Kaiser 2745 138788 \n", "30 40 Nicolas Maillard 2700 136931 \n", "31 41 Greg Phillips 2723 138407 \n", "32 42 Randy Gelhausen 2697 136673 \n", "33 43 Dave Patton 2750 136993 \n" ] } ], "source": [ "summary = [\n", " row_drivers + row_timesheet[1:]\n", " for row_drivers in drivers[1:]\n", " for row_timesheet in sum_timesheet[1:]\n", " if row_drivers[0] == str(row_timesheet[0])\n", "]\n", "\n", "summary = [[\"driverId\", \"name\", \"hours-logged\", \"miles-logged\"]] + summary\n", "pprint(summary)" ] }, { "cell_type": "markdown", "id": "b9b2f308-14bd-4c4b-a801-d9b78fe0f33f", "metadata": {}, "source": [ "## Ordenamiento de la tabla" ] }, { "cell_type": "code", "execution_count": 9, "id": "190760a5-228b-433c-b791-b486dd678a76", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " driverId name hours-logged miles-logged \n", " 0 12 Paul Coddin 2639 135962 \n", " 1 20 Chris Harris 2644 134564 \n", " 2 24 Don Hilborn 2647 134461 \n", " 3 18 Grant Liu 2654 137834 \n", " 4 37 Wes Floyd 2694 137223 \n", " 5 42 Randy Gelhausen 2697 136673 \n", " 6 40 Nicolas Maillard 2700 136931 \n", " 7 17 Eric Mizell 2701 135992 \n", " 8 31 Rommel Garcia 2704 137057 \n", " 9 25 Jean-Philippe Playe 2723 139180 \n", "10 28 Olivier Renault 2723 137469 \n", "11 41 Greg Phillips 2723 138407 \n", "12 13 Joe Niemiec 2727 134126 \n", "13 35 Emil Siemes 2728 138727 \n", "14 26 Michael Aube 2730 137530 \n", "15 22 Nadeem Asghar 2733 137550 \n", "16 15 Rohit Bakshi 2734 138750 \n", "17 32 Ryan Templeton 2736 137422 \n", "18 19 Ajay Singh 2738 137968 \n", "19 39 David Kaiser 2745 138788 \n", "20 16 Tom McCuch 2746 137205 \n", "21 23 Adam Diaz 2750 137980 \n", "22 43 Dave Patton 2750 136993 \n", "23 21 Jeff Markham 2751 138719 \n", "24 33 Sridhara Sabbella 2759 139285 \n", "25 29 Teddy Choi 2760 138255 \n", "26 38 Scott Shaw 2760 137464 \n", "27 27 Mark Lochbihler 2771 137922 \n", "28 30 Dan Rice 2773 137473 \n", "29 14 Adis Cesir 2781 136624 \n", "30 36 Andrew Grande 2795 138025 \n", "31 34 Frank Romano 2811 137728 \n", "32 10 George Vetticaden 3232 147150 \n", "33 11 Jamie Engesser 3642 179300 \n" ] } ], "source": [ "from operator import itemgetter\n", "\n", "sorted_summary = [summary[0]] + sorted([row for row in summary[1:]], key=itemgetter(2))\n", "pprint(sorted_summary)" ] }, { "cell_type": "markdown", "id": "f7227110-ee78-4509-9913-4f7728cb9f09", "metadata": {}, "source": [ "## Búsqueda del máximo o el mínimo" ] }, { "cell_type": "code", "execution_count": 10, "id": "9baa5fee-cb4f-4a4f-b87f-9e1f1183499c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['11', 'Jamie Engesser', 3642, 179300]]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[\n", " row\n", " for row in sorted_summary[1:]\n", " if row[2] == max(aux_row[2] for aux_row in sorted_summary[1:])\n", "]" ] }, { "cell_type": "markdown", "id": "22a980ee-7eda-4dba-9dff-7b3234c1db38", "metadata": {}, "source": [ "## Almacenamiento de los resultados" ] }, { "cell_type": "code", "execution_count": 11, "id": "da05c779-1438-48b9-920c-40cffd5238da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "driverId,name,hours-logged,miles-logged\n", "10,George Vetticaden,3232,147150\n", "11,Jamie Engesser,3642,179300\n", "12,Paul Coddin,2639,135962\n", "13,Joe Niemiec,2727,134126\n", "14,Adis Cesir,2781,136624\n", "15,Rohit Bakshi,2734,138750\n", "16,Tom McCuch,2746,137205\n", "17,Eric Mizell,2701,135992\n", "18,Grant Liu,2654,137834\n", "19,Ajay Singh,2738,137968\n", "20,Chris Harris,2644,134564\n", "21,Jeff Markham,2751,138719\n", "22,Nadeem Asghar,2733,137550\n", "23,Adam Diaz,2750,137980\n", "24,Don Hilborn,2647,134461\n", "25,Jean-Philippe Playe,2723,139180\n", "26,Michael Aube,2730,137530\n", "27,Mark Lochbihler,2771,137922\n", "28,Olivier Renault,2723,137469\n", "29,Teddy Choi,2760,138255\n", "30,Dan Rice,2773,137473\n", "31,Rommel Garcia,2704,137057\n", "32,Ryan Templeton,2736,137422\n", "33,Sridhara Sabbella,2759,139285\n", "34,Frank Romano,2811,137728\n", "35,Emil Siemes,2728,138727\n", "36,Andrew Grande,2795,138025\n", "37,Wes Floyd,2694,137223\n", "38,Scott Shaw,2760,137464\n", "39,David Kaiser,2745,138788\n", "40,Nicolas Maillard,2700,136931\n", "41,Greg Phillips,2723,138407\n", "42,Randy Gelhausen,2697,136673\n", "43,Dave Patton,2750,136993\n" ] } ], "source": [ "summary = [[str(field) for field in row] for row in summary]\n", "summary = [\",\".join(row) for row in summary]\n", "summary = \"\\n\".join(summary)\n", "\n", "with open(\"/tmp/summary.csv\", \"w\") as file:\n", " print(summary, file=file)\n", "\n", "!cat /tmp/summary.csv" ] }, { "cell_type": "code", "execution_count": 12, "id": "43399ef5-5dba-4470-97a7-01c5b7b0d286", "metadata": {}, "outputs": [], "source": [ "!rm /tmp/*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 5 }