{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fe2f6570-5bd1-4972-8e18-546dee1a5972",
   "metadata": {
    "tags": []
   },
   "source": [
    "Procesamiento básico de datos usando Python --- 12:22 min\n",
    "===\n",
    "\n",
    "* 12:22 min | Última modificación: Octubre 12, 2021 "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2ae3369-6149-4c94-a8b0-93673cba17b8",
   "metadata": {},
   "source": [
    "En este tutorial se explica como realizar el procedimiento básico de datos usando Python."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd9c8f4b-b634-4330-98ba-71ee0a6d184f",
   "metadata": {},
   "source": [
    "Al finalizar este tutorial, usted estará en capacidad de:\n",
    "\n",
    "* Descargar archivos de datos desde internet y cargarlos como lista.\n",
    "\n",
    "* Usar la función groupby de la librería itertools.\n",
    "\n",
    "* Unir dos conjuntos de datos usando un campo clave.\n",
    "\n",
    "* Ordenar un conjunto de datos.\n",
    "\n",
    "* Búscar los registros que contienen el valor máximo o mínimo de un campo.\n",
    "\n",
    "* Escribir al disco duro los resultados."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce170a89-7d35-4a69-a01e-b506e647cf7e",
   "metadata": {},
   "source": [
    "## Descarga de datos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "375edeac-df4c-4ea9-8755-32976ec33f6b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/tmp/drivers.csv\n",
      "/tmp/timesheet.csv\n"
     ]
    }
   ],
   "source": [
    "url_drivers = \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/drivers.csv\"\n",
    "!wget --quiet {url_drivers} -P /tmp/\n",
    "\n",
    "url_timesheet = \"https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/timesheet.csv\"\n",
    "!wget --quiet {url_timesheet} -P /tmp/\n",
    "\n",
    "!ls -1 /tmp/*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91baa24c-b461-41cf-890e-488c52968f02",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Impresion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7c171b93-eeae-439b-a2b5-b6fa47363013",
   "metadata": {},
   "outputs": [],
   "source": [
    "def pprint(data):\n",
    "    #\n",
    "    def get_format_string(data):\n",
    "        format_string = []\n",
    "        for i_col in range(len(data[0])):\n",
    "            lengths = [len(str(row[i_col])) for row in data]\n",
    "            max_length = max(lengths)\n",
    "            #\n",
    "            # Por ejemplo: \"{:>10s}\"\n",
    "            #\n",
    "            format_string.append(\"{:>\" + str(max_length) + \"s}\")\n",
    "        return format_string\n",
    "\n",
    "    #\n",
    "    def print_data(format_string, data):\n",
    "        format_string = get_format_string(data)\n",
    "        for index, row in enumerate(data):\n",
    "            text = \"    \" if index == 0 else \"{:2d}  \".format(index - 1)\n",
    "            for fmt, value in zip(format_string, row):\n",
    "                text += fmt.format(str(value)) + \" \"\n",
    "            if len(text) >= 106:\n",
    "                text = text[:100] + \" [...]\"\n",
    "            print(text)\n",
    "\n",
    "    #\n",
    "    format_string = get_format_string(data)\n",
    "    print_data(format_string, data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5033b60b-d867-49f5-8df9-88cdf903e177",
   "metadata": {},
   "source": [
    "## Creación de la tabla drivers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6c392f66-9346-4779-9490-428e0d05da40",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    driverId              name \n",
      " 0        10 George Vetticaden \n",
      " 1        11    Jamie Engesser \n",
      " 2        12       Paul Coddin \n",
      " 3        13       Joe Niemiec \n",
      " 4        14        Adis Cesir \n",
      " 5        15      Rohit Bakshi \n",
      " 6        16        Tom McCuch \n",
      " 7        17       Eric Mizell \n",
      " 8        18         Grant Liu \n"
     ]
    }
   ],
   "source": [
    "with open(\"/tmp/drivers.csv\", \"r\") as file:\n",
    "    drivers = file.readlines()\n",
    "\n",
    "drivers = [row.replace(\"\\n\", \"\") for row in drivers]\n",
    "drivers = [row.split(\",\") for row in drivers]\n",
    "drivers = [row[:2] for row in drivers]\n",
    "pprint(drivers[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c636746e-91c8-4d37-8127-408b7b6bbab4",
   "metadata": {},
   "source": [
    "## Creación de la tabla timesheet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f225cd42-b0fc-4046-996d-26b9ad614773",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    driverId week hours-logged miles-logged \n",
      " 0        10    1           70         3300 \n",
      " 1        10    2           70         3300 \n",
      " 2        10    3           60         2800 \n",
      " 3        10    4           70         3100 \n",
      " 4        10    5           70         3200 \n",
      " 5        10    6           70         3300 \n",
      " 6        10    7           70         3000 \n",
      " 7        10    8           70         3300 \n",
      " 8        10    9           70         3200 \n"
     ]
    }
   ],
   "source": [
    "with open(\"/tmp/timesheet.csv\", \"r\") as file:\n",
    "    timesheet = file.readlines()\n",
    "\n",
    "timesheet = [row.replace(\"\\n\", \"\") for row in timesheet]\n",
    "timesheet = [row.split(\",\") for row in timesheet]\n",
    "pprint(timesheet[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1e56dc0a-5a22-4fc9-a1f8-3c181fe39e70",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    driverId week hours-logged miles-logged \n",
      " 0        10    1           70         3300 \n",
      " 1        10    2           70         3300 \n",
      " 2        10    3           60         2800 \n",
      " 3        10    4           70         3100 \n",
      " 4        10    5           70         3200 \n",
      " 5        10    6           70         3300 \n",
      " 6        10    7           70         3000 \n",
      " 7        10    8           70         3300 \n",
      " 8        10    9           70         3200 \n"
     ]
    }
   ],
   "source": [
    "#\n",
    "# Typecast de datos\n",
    "#\n",
    "timesheet = [\n",
    "    [int(field) if i_row > 0 else field for field in row]\n",
    "    for i_row, row in enumerate(timesheet)\n",
    "]\n",
    "pprint(timesheet[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0b19066-bc57-4359-89d1-a7bb33d3b3c9",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Cantidad de horas y millas de cada conductor por año"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "d660e743-4da7-4eca-aa9f-14276b12260c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10\n",
      "    [10, 1, 70, 3300]\n",
      "    [10, 2, 70, 3300]\n",
      "    [10, 3, 60, 2800]\n",
      "    [10, 4, 70, 3100]\n",
      "    [10, 5, 70, 3200]\n",
      "    ...\n",
      "11\n",
      "    [11, 1, 50, 3000]\n",
      "    [11, 2, 83, 4000]\n",
      "    [11, 3, 80, 4000]\n",
      "    [11, 4, 85, 4000]\n",
      "    [11, 5, 90, 4100]\n",
      "    ...\n",
      "12\n",
      "    [12, 1, 49, 2783]\n",
      "    [12, 2, 50, 2505]\n",
      "    [12, 3, 51, 2577]\n",
      "    [12, 4, 54, 2743]\n",
      "    [12, 5, 47, 2791]\n",
      "    ...\n",
      "13\n",
      "    [13, 1, 49, 2643]\n",
      "    [13, 2, 56, 2553]\n",
      "    [13, 3, 60, 2539]\n",
      "    [13, 4, 55, 2553]\n",
      "    [13, 5, 45, 2762]\n",
      "    ...\n",
      "...\n"
     ]
    }
   ],
   "source": [
    "import itertools\n",
    "from operator import itemgetter\n",
    "\n",
    "for i_key, (key, group) in enumerate(\n",
    "    itertools.groupby(\n",
    "        timesheet[1:],\n",
    "        itemgetter(0),\n",
    "    )\n",
    "):\n",
    "    print(key)\n",
    "\n",
    "    for i_grp, grp in enumerate(group):\n",
    "        print(\"   \", grp)\n",
    "        if i_grp > 3:\n",
    "            print(\"    ...\")\n",
    "            break\n",
    "\n",
    "    if i_key > 2:\n",
    "        print(\"...\")\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3cad0dcb-aee2-4674-bcc7-d3ecfa866a97",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    driverId hours-logged miles-logged \n",
      " 0        10         3232       147150 \n",
      " 1        11         3642       179300 \n",
      " 2        12         2639       135962 \n",
      " 3        13         2727       134126 \n",
      " 4        14         2781       136624 \n",
      " 5        15         2734       138750 \n",
      " 6        16         2746       137205 \n",
      " 7        17         2701       135992 \n",
      " 8        18         2654       137834 \n",
      " 9        19         2738       137968 \n",
      "10        20         2644       134564 \n",
      "11        21         2751       138719 \n",
      "12        22         2733       137550 \n",
      "13        23         2750       137980 \n",
      "14        24         2647       134461 \n",
      "15        25         2723       139180 \n",
      "16        26         2730       137530 \n",
      "17        27         2771       137922 \n",
      "18        28         2723       137469 \n",
      "19        29         2760       138255 \n",
      "20        30         2773       137473 \n",
      "21        31         2704       137057 \n",
      "22        32         2736       137422 \n",
      "23        33         2759       139285 \n",
      "24        34         2811       137728 \n",
      "25        35         2728       138727 \n",
      "26        36         2795       138025 \n",
      "27        37         2694       137223 \n",
      "28        38         2760       137464 \n",
      "29        39         2745       138788 \n",
      "30        40         2700       136931 \n",
      "31        41         2723       138407 \n",
      "32        42         2697       136673 \n",
      "33        43         2750       136993 \n"
     ]
    }
   ],
   "source": [
    "timesheet_grouped_by_driverId = {\n",
    "    driverId: list(group)\n",
    "    for driverId, group in itertools.groupby(\n",
    "        timesheet[1:],\n",
    "        key=itemgetter(0),\n",
    "    )\n",
    "}\n",
    "\n",
    "sum_timesheet = [\n",
    "    [\n",
    "        driverId,\n",
    "        sum([row[2] for row in timesheet_grouped_by_driverId[driverId]]),\n",
    "        sum([row[3] for row in timesheet_grouped_by_driverId[driverId]]),\n",
    "    ]\n",
    "    for driverId in timesheet_grouped_by_driverId.keys()\n",
    "]\n",
    "\n",
    "sum_timesheet = [[\"driverId\", \"hours-logged\", \"miles-logged\"]] + sum_timesheet\n",
    "\n",
    "pprint(sum_timesheet)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdb723b8-533c-4eff-b64f-bfbf2debe4db",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Unión de las tablas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "37e2fb01-9847-4f65-9f32-df3a29596875",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    driverId                name hours-logged miles-logged \n",
      " 0        10   George Vetticaden         3232       147150 \n",
      " 1        11      Jamie Engesser         3642       179300 \n",
      " 2        12         Paul Coddin         2639       135962 \n",
      " 3        13         Joe Niemiec         2727       134126 \n",
      " 4        14          Adis Cesir         2781       136624 \n",
      " 5        15        Rohit Bakshi         2734       138750 \n",
      " 6        16          Tom McCuch         2746       137205 \n",
      " 7        17         Eric Mizell         2701       135992 \n",
      " 8        18           Grant Liu         2654       137834 \n",
      " 9        19          Ajay Singh         2738       137968 \n",
      "10        20        Chris Harris         2644       134564 \n",
      "11        21        Jeff Markham         2751       138719 \n",
      "12        22       Nadeem Asghar         2733       137550 \n",
      "13        23           Adam Diaz         2750       137980 \n",
      "14        24         Don Hilborn         2647       134461 \n",
      "15        25 Jean-Philippe Playe         2723       139180 \n",
      "16        26        Michael Aube         2730       137530 \n",
      "17        27     Mark Lochbihler         2771       137922 \n",
      "18        28     Olivier Renault         2723       137469 \n",
      "19        29          Teddy Choi         2760       138255 \n",
      "20        30            Dan Rice         2773       137473 \n",
      "21        31       Rommel Garcia         2704       137057 \n",
      "22        32      Ryan Templeton         2736       137422 \n",
      "23        33   Sridhara Sabbella         2759       139285 \n",
      "24        34        Frank Romano         2811       137728 \n",
      "25        35         Emil Siemes         2728       138727 \n",
      "26        36       Andrew Grande         2795       138025 \n",
      "27        37           Wes Floyd         2694       137223 \n",
      "28        38          Scott Shaw         2760       137464 \n",
      "29        39        David Kaiser         2745       138788 \n",
      "30        40    Nicolas Maillard         2700       136931 \n",
      "31        41       Greg Phillips         2723       138407 \n",
      "32        42     Randy Gelhausen         2697       136673 \n",
      "33        43         Dave Patton         2750       136993 \n"
     ]
    }
   ],
   "source": [
    "summary = [\n",
    "    row_drivers + row_timesheet[1:]\n",
    "    for row_drivers in drivers[1:]\n",
    "    for row_timesheet in sum_timesheet[1:]\n",
    "    if row_drivers[0] == str(row_timesheet[0])\n",
    "]\n",
    "\n",
    "summary = [[\"driverId\", \"name\", \"hours-logged\", \"miles-logged\"]] + summary\n",
    "pprint(summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9b2f308-14bd-4c4b-a801-d9b78fe0f33f",
   "metadata": {},
   "source": [
    "## Ordenamiento de la tabla"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "190760a5-228b-433c-b791-b486dd678a76",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    driverId                name hours-logged miles-logged \n",
      " 0        12         Paul Coddin         2639       135962 \n",
      " 1        20        Chris Harris         2644       134564 \n",
      " 2        24         Don Hilborn         2647       134461 \n",
      " 3        18           Grant Liu         2654       137834 \n",
      " 4        37           Wes Floyd         2694       137223 \n",
      " 5        42     Randy Gelhausen         2697       136673 \n",
      " 6        40    Nicolas Maillard         2700       136931 \n",
      " 7        17         Eric Mizell         2701       135992 \n",
      " 8        31       Rommel Garcia         2704       137057 \n",
      " 9        25 Jean-Philippe Playe         2723       139180 \n",
      "10        28     Olivier Renault         2723       137469 \n",
      "11        41       Greg Phillips         2723       138407 \n",
      "12        13         Joe Niemiec         2727       134126 \n",
      "13        35         Emil Siemes         2728       138727 \n",
      "14        26        Michael Aube         2730       137530 \n",
      "15        22       Nadeem Asghar         2733       137550 \n",
      "16        15        Rohit Bakshi         2734       138750 \n",
      "17        32      Ryan Templeton         2736       137422 \n",
      "18        19          Ajay Singh         2738       137968 \n",
      "19        39        David Kaiser         2745       138788 \n",
      "20        16          Tom McCuch         2746       137205 \n",
      "21        23           Adam Diaz         2750       137980 \n",
      "22        43         Dave Patton         2750       136993 \n",
      "23        21        Jeff Markham         2751       138719 \n",
      "24        33   Sridhara Sabbella         2759       139285 \n",
      "25        29          Teddy Choi         2760       138255 \n",
      "26        38          Scott Shaw         2760       137464 \n",
      "27        27     Mark Lochbihler         2771       137922 \n",
      "28        30            Dan Rice         2773       137473 \n",
      "29        14          Adis Cesir         2781       136624 \n",
      "30        36       Andrew Grande         2795       138025 \n",
      "31        34        Frank Romano         2811       137728 \n",
      "32        10   George Vetticaden         3232       147150 \n",
      "33        11      Jamie Engesser         3642       179300 \n"
     ]
    }
   ],
   "source": [
    "from operator import itemgetter\n",
    "\n",
    "sorted_summary = [summary[0]] + sorted([row for row in summary[1:]], key=itemgetter(2))\n",
    "pprint(sorted_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7227110-ee78-4509-9913-4f7728cb9f09",
   "metadata": {},
   "source": [
    "## Búsqueda del máximo o el mínimo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9baa5fee-cb4f-4a4f-b87f-9e1f1183499c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['11', 'Jamie Engesser', 3642, 179300]]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[\n",
    "    row\n",
    "    for row in sorted_summary[1:]\n",
    "    if row[2] == max(aux_row[2] for aux_row in sorted_summary[1:])\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22a980ee-7eda-4dba-9dff-7b3234c1db38",
   "metadata": {},
   "source": [
    "## Almacenamiento de los resultados"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "da05c779-1438-48b9-920c-40cffd5238da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "driverId,name,hours-logged,miles-logged\n",
      "10,George Vetticaden,3232,147150\n",
      "11,Jamie Engesser,3642,179300\n",
      "12,Paul Coddin,2639,135962\n",
      "13,Joe Niemiec,2727,134126\n",
      "14,Adis Cesir,2781,136624\n",
      "15,Rohit Bakshi,2734,138750\n",
      "16,Tom McCuch,2746,137205\n",
      "17,Eric Mizell,2701,135992\n",
      "18,Grant Liu,2654,137834\n",
      "19,Ajay Singh,2738,137968\n",
      "20,Chris Harris,2644,134564\n",
      "21,Jeff Markham,2751,138719\n",
      "22,Nadeem Asghar,2733,137550\n",
      "23,Adam Diaz,2750,137980\n",
      "24,Don Hilborn,2647,134461\n",
      "25,Jean-Philippe Playe,2723,139180\n",
      "26,Michael Aube,2730,137530\n",
      "27,Mark Lochbihler,2771,137922\n",
      "28,Olivier Renault,2723,137469\n",
      "29,Teddy Choi,2760,138255\n",
      "30,Dan Rice,2773,137473\n",
      "31,Rommel Garcia,2704,137057\n",
      "32,Ryan Templeton,2736,137422\n",
      "33,Sridhara Sabbella,2759,139285\n",
      "34,Frank Romano,2811,137728\n",
      "35,Emil Siemes,2728,138727\n",
      "36,Andrew Grande,2795,138025\n",
      "37,Wes Floyd,2694,137223\n",
      "38,Scott Shaw,2760,137464\n",
      "39,David Kaiser,2745,138788\n",
      "40,Nicolas Maillard,2700,136931\n",
      "41,Greg Phillips,2723,138407\n",
      "42,Randy Gelhausen,2697,136673\n",
      "43,Dave Patton,2750,136993\n"
     ]
    }
   ],
   "source": [
    "summary = [[str(field) for field in row] for row in summary]\n",
    "summary = [\",\".join(row) for row in summary]\n",
    "summary = \"\\n\".join(summary)\n",
    "\n",
    "with open(\"/tmp/summary.csv\", \"w\") as file:\n",
    "    print(summary, file=file)\n",
    "\n",
    "!cat /tmp/summary.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "43399ef5-5dba-4470-97a7-01c5b7b0d286",
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm /tmp/*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}