Fusión (merging/join) de datasets — 14:56 min

  • 14:56 min | Última modificación: Octubre 6, 2021

Realiza la unión de datasets de forma similar a como se hacen los joins en las bases de datos.

[1]:
import pandas as pd

pd.set_option("display.notebook_repr_html", False)

Fusión uno-a-uno sobre columnas comunes

El join se ejecuta sobre las columnas que tienen los mismos nombres en ambas tablas.

Creación y carga del primer dataframe.

[2]:
%%writefile /tmp/clients.csv
clientId,key,name
11,B,Ferris Q. Le
10,A,Omar Y. Fletcher
11,A,Mira N. Franklin
10,B,Buffy W. Vincent
Writing /tmp/clients.csv

Creación y carga del segundo dataframe.

[3]:
%%writefile /tmp/bonus.csv
clientId,key,bonus
10,A,138
10,B,227
11,A,279
11,B,160
Overwriting /tmp/bonus.csv

Merge usando como claves las columnas comunes

[4]:
clients = pd.read_csv("/tmp/clients.csv")
bonus = pd.read_csv("/tmp/bonus.csv")

display(
    clients.sort_values(['clientId', 'key']),
    "",
    bonus.sort_values(['clientId', 'key']),
    "",
    pd.merge(
        clients,
        bonus,
        sort=True,
    )
)
   clientId key              name
1        10   A  Omar Y. Fletcher
3        10   B  Buffy W. Vincent
2        11   A  Mira N. Franklin
0        11   B      Ferris Q. Le
''
   clientId key  bonus
0        10   A    138
1        10   B    227
2        11   A    279
3        11   B    160
''
   clientId key              name  bonus
0        10   A  Omar Y. Fletcher    138
1        10   B  Buffy W. Vincent    227
2        11   A  Mira N. Franklin    279
3        11   B      Ferris Q. Le    160

Fusión uno-a-uno con registros incompletos

[5]:
%%writefile /tmp/clients.csv
clientId,name
10,Ferris Q. Le
11,Mira N. Franklin
12,Baker C. Hurst
Overwriting /tmp/clients.csv
[6]:
%%writefile /tmp/bonus.csv
clientId,bonus
10,279
11,160
20,169
21,263
Overwriting /tmp/bonus.csv
[7]:
clients = pd.read_csv("/tmp/clients.csv")
bonus = pd.read_csv("/tmp/bonus.csv")

#
# Note que merge solo retorno los registros con
# información completa
#
display(
    clients,
    "",
    bonus,
    "",
    pd.merge(
        clients,
        bonus,
        on="clientId",
    )
)
   clientId              name
0        10      Ferris Q. Le
1        11  Mira N. Franklin
2        12    Baker C. Hurst
''
   clientId  bonus
0        10    279
1        11    160
2        20    169
3        21    263
''
   clientId              name  bonus
0        10      Ferris Q. Le    279
1        11  Mira N. Franklin    160

Fusión muchos-a-uno

[8]:
%%writefile /tmp/clients.csv
clientId,name
10,Ferris Q. Le
11,Mira N. Franklin
12,Baker C. Hurst
Overwriting /tmp/clients.csv
[9]:
%%writefile /tmp/sales.csv
clientId, month, sales
10, jan, 1239
10, feb, 387
11, jan, 454
11, mar, 495
11, sep, 145
12, may, 4959
12, dec, 493
12, oct, 4981
12, jan, 484
15, may, 394
15, sep, 585
Writing /tmp/sales.csv
[10]:
clients = pd.read_csv("/tmp/clients.csv")
sales = pd.read_csv("/tmp/sales.csv")

#
# Sales tiene valores del campo clientId
# duplicados. Note que se copia la información
# de la tabla de la derecha
#
display(
    clients,
    "",
    sales,
    "",
    pd.merge(
        sales,
        clients,
        sort=True,
    )
)
   clientId              name
0        10      Ferris Q. Le
1        11  Mira N. Franklin
2        12    Baker C. Hurst
''
    clientId  month   sales
0         10    jan    1239
1         10    feb     387
2         11    jan     454
3         11    mar     495
4         11    sep     145
5         12    may    4959
6         12    dec     493
7         12    oct    4981
8         12    jan     484
9         15    may     394
10        15    sep     585
''
   clientId  month   sales              name
0        10    jan    1239      Ferris Q. Le
1        10    feb     387      Ferris Q. Le
2        11    jan     454  Mira N. Franklin
3        11    mar     495  Mira N. Franklin
4        11    sep     145  Mira N. Franklin
5        12    may    4959    Baker C. Hurst
6        12    dec     493    Baker C. Hurst
7        12    oct    4981    Baker C. Hurst
8        12    jan     484    Baker C. Hurst

Fusión muchos-a-muchos

[11]:
%%writefile /tmp/sales.csv
clientId,month
10,jan
10,feb
11,jan
11,mar
11,sep
12,may
12,dec
12,oct
12,jan
Overwriting /tmp/sales.csv
[12]:
%%writefile /tmp/lines.csv
clientId,line
10,A
10,B
10,C
11,D
12,A
12,D
13,B
13,C
13,D
Writing /tmp/lines.csv
[13]:
sales = pd.read_csv("/tmp/sales.csv")
lines = pd.read_csv("/tmp/lines.csv")

display(
    sales,
    "",
    lines,
    "",
    pd.merge(
        sales,
        lines,
    )
)
   clientId month
0        10   jan
1        10   feb
2        11   jan
3        11   mar
4        11   sep
5        12   may
6        12   dec
7        12   oct
8        12   jan
''
   clientId line
0        10    A
1        10    B
2        10    C
3        11    D
4        12    A
5        12    D
6        13    B
7        13    C
8        13    D
''
    clientId month line
0         10   jan    A
1         10   jan    B
2         10   jan    C
3         10   feb    A
4         10   feb    B
5         10   feb    C
6         11   jan    D
7         11   mar    D
8         11   sep    D
9         12   may    A
10        12   may    D
11        12   dec    A
12        12   dec    D
13        12   oct    A
14        12   oct    D
15        12   jan    A
16        12   jan    D

Parámetros left_on, right_on

El campo clave tiene nombres diferentes en cada tabla.

[14]:
%%writefile /tmp/clients.csv
clientId,name
13,Ferris Q. Le
10,Omar Y. Fletcher
12,Mira N. Franklin
11,Buffy W. Vincent
Overwriting /tmp/clients.csv
[15]:
%%writefile /tmp/bonus.csv
Client-Id,bonus
10,279
11,160
12,267
13,215
Overwriting /tmp/bonus.csv
[16]:
clients = pd.read_csv("/tmp/clients.csv")
bonus = pd.read_csv("/tmp/bonus.csv")

#
# Note que las dos claves aparecen en la tabla
#
display(
    clients,
    "",
    bonus,
    "",
    pd.merge(
        clients,
        bonus,
        left_on="clientId",
        right_on="Client-Id",
    )
)
   clientId              name
0        13      Ferris Q. Le
1        10  Omar Y. Fletcher
2        12  Mira N. Franklin
3        11  Buffy W. Vincent
''
   Client-Id  bonus
0         10    279
1         11    160
2         12    267
3         13    215
''
   clientId              name  Client-Id  bonus
0        13      Ferris Q. Le         13    215
1        10  Omar Y. Fletcher         10    279
2        12  Mira N. Franklin         12    267
3        11  Buffy W. Vincent         11    160
[17]:
#
# Remoción de una de las columnas
#
display(
    clients,
    "",
    bonus,
    "",
    pd.merge(
        clients,
        bonus,
        left_on="clientId",
        right_on="Client-Id",
    ).drop(
        "Client-Id",
        axis=1,
    )
)
   clientId              name
0        13      Ferris Q. Le
1        10  Omar Y. Fletcher
2        12  Mira N. Franklin
3        11  Buffy W. Vincent
''
   Client-Id  bonus
0         10    279
1         11    160
2         12    267
3         13    215
''
   clientId              name  bonus
0        13      Ferris Q. Le    215
1        10  Omar Y. Fletcher    279
2        12  Mira N. Franklin    267
3        11  Buffy W. Vincent    160

Joins usando el parámetro how

joins.png

Ambas tablas tienen algunos campos en comun (y otros diferentes!).

[18]:
%%writefile /tmp/clients_a.csv
clientId,name,location,amount
10,Omar Y. Fletcher,6833 Mollis. Rd.,4929
11,Buffy W. Vincent,"P.O. Box 345, 8390 Ante Avenue",7366
12,Mira N. Franklin,"P.O. Box 445, 323 Cursus Rd.",6184
Writing /tmp/clients_a.csv
[19]:
%%writefile /tmp/clients_b.csv
clientId,name,location,bonus
11,Buffy W. Vincent,"P.O. Box 345, 8390 Ante Avenue",100
12,Mira N. Franklin,"P.O. Box 445, 323 Cursus Rd.",200
13,Lilah O. Morrison,3859 Mauris Ave,300
Writing /tmp/clients_b.csv
[20]:
clients_a = pd.read_csv("/tmp/clients_a.csv")
clients_b = pd.read_csv("/tmp/clients_b.csv")

#
# Campos diferentes en cada tabla
#
display(
    set(clients_a.columns) - set(clients_b.columns),
    set(clients_b.columns) - set(clients_a.columns),
)
{'amount'}
{'bonus'}
[21]:
#
# Inner join
#
display(
    clients_a,
    "",
    clients_b,
    "",

    pd.merge(
        clients_a,
        clients_b,
        how="inner",
    )
)
   clientId              name                        location  amount
0        10  Omar Y. Fletcher                6833 Mollis. Rd.    4929
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184
''
   clientId               name                        location  bonus
0        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    100
1        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    200
2        13  Lilah O. Morrison                 3859 Mauris Ave    300
''
   clientId              name                        location  amount  bonus
0        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366    100
1        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184    200
[22]:
#
# Left join
#
display(
    clients_a,
    "",
    clients_b,
    "",

    pd.merge(
        clients_a,
        clients_b,
        how="left",
    )
)
   clientId              name                        location  amount
0        10  Omar Y. Fletcher                6833 Mollis. Rd.    4929
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184
''
   clientId               name                        location  bonus
0        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    100
1        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    200
2        13  Lilah O. Morrison                 3859 Mauris Ave    300
''
   clientId              name                        location  amount  bonus
0        10  Omar Y. Fletcher                6833 Mollis. Rd.    4929    NaN
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366  100.0
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184  200.0
[23]:
#
# Right join
#
display(
    clients_a,
    "",
    clients_b,
    "",

    pd.merge(
        clients_a,
        clients_b,
        how="right",
    )
)
   clientId              name                        location  amount
0        10  Omar Y. Fletcher                6833 Mollis. Rd.    4929
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184
''
   clientId               name                        location  bonus
0        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    100
1        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    200
2        13  Lilah O. Morrison                 3859 Mauris Ave    300
''
   clientId               name                        location  amount  bonus
0        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue  7366.0    100
1        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.  6184.0    200
2        13  Lilah O. Morrison                 3859 Mauris Ave     NaN    300
[24]:
#
# Outer join
#
display(
    clients_a,
    "",
    clients_b,
    "",

    pd.merge(
        clients_a,
        clients_b,
        how="outer",
    )
)
   clientId              name                        location  amount
0        10  Omar Y. Fletcher                6833 Mollis. Rd.    4929
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184
''
   clientId               name                        location  bonus
0        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    100
1        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    200
2        13  Lilah O. Morrison                 3859 Mauris Ave    300
''
   clientId               name                        location  amount  bonus
0        10   Omar Y. Fletcher                6833 Mollis. Rd.  4929.0    NaN
1        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue  7366.0  100.0
2        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.  6184.0  200.0
3        13  Lilah O. Morrison                 3859 Mauris Ave     NaN  300.0
[25]:
#
# Outer join
#
display(
    clients_a,
    "",
    clients_b,
    "",

    pd.merge(
        clients_a,
        clients_b,
        how="outer",
        indicator=True,
    )
)
   clientId              name                        location  amount
0        10  Omar Y. Fletcher                6833 Mollis. Rd.    4929
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    7366
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    6184
''
   clientId               name                        location  bonus
0        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue    100
1        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.    200
2        13  Lilah O. Morrison                 3859 Mauris Ave    300
''
   clientId               name                        location  amount  bonus  \
0        10   Omar Y. Fletcher                6833 Mollis. Rd.  4929.0    NaN
1        11   Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue  7366.0  100.0
2        12   Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.  6184.0  200.0
3        13  Lilah O. Morrison                 3859 Mauris Ave     NaN  300.0

       _merge
0   left_only
1        both
2        both
3  right_only

Sufijos

[26]:
%%writefile /tmp/data_1.csv
clientId,info
10,Omar Y. Fletcher
11,Buffy W. Vincent
12,Mira N. Franklin
Writing /tmp/data_1.csv
[27]:
%%writefile /tmp/data_2.csv
clientId,info
10,6833 Mollis. Rd.
11,"P.O. Box 345, 8390 Ante Avenue"
12,"P.O. Box 445, 323 Cursus Rd."
Writing /tmp/data_2.csv
[28]:
df_1 = pd.read_csv("/tmp/data_1.csv")
df_2 = pd.read_csv("/tmp/data_2.csv")


#
# Sufijos generados para desambiguar los nombres
# de las columnas
#
display(
    df_1,
    "",
    df_2,
    "",
    pd.merge(
        df_1,
        df_2,
        on="clientId",
    )
)
   clientId              info
0        10  Omar Y. Fletcher
1        11  Buffy W. Vincent
2        12  Mira N. Franklin
''
   clientId                            info
0        10                6833 Mollis. Rd.
1        11  P.O. Box 345, 8390 Ante Avenue
2        12    P.O. Box 445, 323 Cursus Rd.
''
   clientId            info_x                          info_y
0        10  Omar Y. Fletcher                6833 Mollis. Rd.
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.
[29]:
#
# Especificación del sufijo
#
display(
    df_1,
    "",
    df_2,
    "",
    pd.merge(
        df_1,
        df_2,
        on="clientId",
        suffixes=["_LD", "_RD"],
    )
)
   clientId              info
0        10  Omar Y. Fletcher
1        11  Buffy W. Vincent
2        12  Mira N. Franklin
''
   clientId                            info
0        10                6833 Mollis. Rd.
1        11  P.O. Box 345, 8390 Ante Avenue
2        12    P.O. Box 445, 323 Cursus Rd.
''
   clientId           info_LD                         info_RD
0        10  Omar Y. Fletcher                6833 Mollis. Rd.
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.
[30]:
#
# Renombramiento de las columnas
#
display(
    df_1,
    "",
    df_2,
    "",
    pd.merge(
        df_1,
        df_2,
        on="clientId",
    ).rename(
        columns={
            "info_x": "name",
            "info_y": "location",
        }
    )
)
   clientId              info
0        10  Omar Y. Fletcher
1        11  Buffy W. Vincent
2        12  Mira N. Franklin
''
   clientId                            info
0        10                6833 Mollis. Rd.
1        11  P.O. Box 345, 8390 Ante Avenue
2        12    P.O. Box 445, 323 Cursus Rd.
''
   clientId              name                        location
0        10  Omar Y. Fletcher                6833 Mollis. Rd.
1        11  Buffy W. Vincent  P.O. Box 345, 8390 Ante Avenue
2        12  Mira N. Franklin    P.O. Box 445, 323 Cursus Rd.