Fusión (merging/join) de datasets — 14:56 min
14:56 min | Última modificación: Octubre 6, 2021
Realiza la unión de datasets de forma similar a como se hacen los joins en las bases de datos.
[1]:
import pandas as pd
pd.set_option("display.notebook_repr_html", False)
Fusión uno-a-uno sobre columnas comunes
El join se ejecuta sobre las columnas que tienen los mismos nombres en ambas tablas.
Creación y carga del primer dataframe.
[2]:
%%writefile /tmp/clients.csv
clientId,key,name
11,B,Ferris Q. Le
10,A,Omar Y. Fletcher
11,A,Mira N. Franklin
10,B,Buffy W. Vincent
Writing /tmp/clients.csv
Creación y carga del segundo dataframe.
[3]:
%%writefile /tmp/bonus.csv
clientId,key,bonus
10,A,138
10,B,227
11,A,279
11,B,160
Overwriting /tmp/bonus.csv
Merge usando como claves las columnas comunes
[4]:
clients = pd.read_csv("/tmp/clients.csv")
bonus = pd.read_csv("/tmp/bonus.csv")
display(
clients.sort_values(['clientId', 'key']),
"",
bonus.sort_values(['clientId', 'key']),
"",
pd.merge(
clients,
bonus,
sort=True,
)
)
clientId key name
1 10 A Omar Y. Fletcher
3 10 B Buffy W. Vincent
2 11 A Mira N. Franklin
0 11 B Ferris Q. Le
''
clientId key bonus
0 10 A 138
1 10 B 227
2 11 A 279
3 11 B 160
''
clientId key name bonus
0 10 A Omar Y. Fletcher 138
1 10 B Buffy W. Vincent 227
2 11 A Mira N. Franklin 279
3 11 B Ferris Q. Le 160
Fusión uno-a-uno con registros incompletos
[5]:
%%writefile /tmp/clients.csv
clientId,name
10,Ferris Q. Le
11,Mira N. Franklin
12,Baker C. Hurst
Overwriting /tmp/clients.csv
[6]:
%%writefile /tmp/bonus.csv
clientId,bonus
10,279
11,160
20,169
21,263
Overwriting /tmp/bonus.csv
[7]:
clients = pd.read_csv("/tmp/clients.csv")
bonus = pd.read_csv("/tmp/bonus.csv")
#
# Note que merge solo retorno los registros con
# información completa
#
display(
clients,
"",
bonus,
"",
pd.merge(
clients,
bonus,
on="clientId",
)
)
clientId name
0 10 Ferris Q. Le
1 11 Mira N. Franklin
2 12 Baker C. Hurst
''
clientId bonus
0 10 279
1 11 160
2 20 169
3 21 263
''
clientId name bonus
0 10 Ferris Q. Le 279
1 11 Mira N. Franklin 160
Fusión muchos-a-uno
[8]:
%%writefile /tmp/clients.csv
clientId,name
10,Ferris Q. Le
11,Mira N. Franklin
12,Baker C. Hurst
Overwriting /tmp/clients.csv
[9]:
%%writefile /tmp/sales.csv
clientId, month, sales
10, jan, 1239
10, feb, 387
11, jan, 454
11, mar, 495
11, sep, 145
12, may, 4959
12, dec, 493
12, oct, 4981
12, jan, 484
15, may, 394
15, sep, 585
Writing /tmp/sales.csv
[10]:
clients = pd.read_csv("/tmp/clients.csv")
sales = pd.read_csv("/tmp/sales.csv")
#
# Sales tiene valores del campo clientId
# duplicados. Note que se copia la información
# de la tabla de la derecha
#
display(
clients,
"",
sales,
"",
pd.merge(
sales,
clients,
sort=True,
)
)
clientId name
0 10 Ferris Q. Le
1 11 Mira N. Franklin
2 12 Baker C. Hurst
''
clientId month sales
0 10 jan 1239
1 10 feb 387
2 11 jan 454
3 11 mar 495
4 11 sep 145
5 12 may 4959
6 12 dec 493
7 12 oct 4981
8 12 jan 484
9 15 may 394
10 15 sep 585
''
clientId month sales name
0 10 jan 1239 Ferris Q. Le
1 10 feb 387 Ferris Q. Le
2 11 jan 454 Mira N. Franklin
3 11 mar 495 Mira N. Franklin
4 11 sep 145 Mira N. Franklin
5 12 may 4959 Baker C. Hurst
6 12 dec 493 Baker C. Hurst
7 12 oct 4981 Baker C. Hurst
8 12 jan 484 Baker C. Hurst
Fusión muchos-a-muchos
[11]:
%%writefile /tmp/sales.csv
clientId,month
10,jan
10,feb
11,jan
11,mar
11,sep
12,may
12,dec
12,oct
12,jan
Overwriting /tmp/sales.csv
[12]:
%%writefile /tmp/lines.csv
clientId,line
10,A
10,B
10,C
11,D
12,A
12,D
13,B
13,C
13,D
Writing /tmp/lines.csv
[13]:
sales = pd.read_csv("/tmp/sales.csv")
lines = pd.read_csv("/tmp/lines.csv")
display(
sales,
"",
lines,
"",
pd.merge(
sales,
lines,
)
)
clientId month
0 10 jan
1 10 feb
2 11 jan
3 11 mar
4 11 sep
5 12 may
6 12 dec
7 12 oct
8 12 jan
''
clientId line
0 10 A
1 10 B
2 10 C
3 11 D
4 12 A
5 12 D
6 13 B
7 13 C
8 13 D
''
clientId month line
0 10 jan A
1 10 jan B
2 10 jan C
3 10 feb A
4 10 feb B
5 10 feb C
6 11 jan D
7 11 mar D
8 11 sep D
9 12 may A
10 12 may D
11 12 dec A
12 12 dec D
13 12 oct A
14 12 oct D
15 12 jan A
16 12 jan D
Parámetros left_on, right_on
El campo clave tiene nombres diferentes en cada tabla.
[14]:
%%writefile /tmp/clients.csv
clientId,name
13,Ferris Q. Le
10,Omar Y. Fletcher
12,Mira N. Franklin
11,Buffy W. Vincent
Overwriting /tmp/clients.csv
[15]:
%%writefile /tmp/bonus.csv
Client-Id,bonus
10,279
11,160
12,267
13,215
Overwriting /tmp/bonus.csv
[16]:
clients = pd.read_csv("/tmp/clients.csv")
bonus = pd.read_csv("/tmp/bonus.csv")
#
# Note que las dos claves aparecen en la tabla
#
display(
clients,
"",
bonus,
"",
pd.merge(
clients,
bonus,
left_on="clientId",
right_on="Client-Id",
)
)
clientId name
0 13 Ferris Q. Le
1 10 Omar Y. Fletcher
2 12 Mira N. Franklin
3 11 Buffy W. Vincent
''
Client-Id bonus
0 10 279
1 11 160
2 12 267
3 13 215
''
clientId name Client-Id bonus
0 13 Ferris Q. Le 13 215
1 10 Omar Y. Fletcher 10 279
2 12 Mira N. Franklin 12 267
3 11 Buffy W. Vincent 11 160
[17]:
#
# Remoción de una de las columnas
#
display(
clients,
"",
bonus,
"",
pd.merge(
clients,
bonus,
left_on="clientId",
right_on="Client-Id",
).drop(
"Client-Id",
axis=1,
)
)
clientId name
0 13 Ferris Q. Le
1 10 Omar Y. Fletcher
2 12 Mira N. Franklin
3 11 Buffy W. Vincent
''
Client-Id bonus
0 10 279
1 11 160
2 12 267
3 13 215
''
clientId name bonus
0 13 Ferris Q. Le 215
1 10 Omar Y. Fletcher 279
2 12 Mira N. Franklin 267
3 11 Buffy W. Vincent 160
Joins usando el parámetro how
Ambas tablas tienen algunos campos en comun (y otros diferentes!).
[18]:
%%writefile /tmp/clients_a.csv
clientId,name,location,amount
10,Omar Y. Fletcher,6833 Mollis. Rd.,4929
11,Buffy W. Vincent,"P.O. Box 345, 8390 Ante Avenue",7366
12,Mira N. Franklin,"P.O. Box 445, 323 Cursus Rd.",6184
Writing /tmp/clients_a.csv
[19]:
%%writefile /tmp/clients_b.csv
clientId,name,location,bonus
11,Buffy W. Vincent,"P.O. Box 345, 8390 Ante Avenue",100
12,Mira N. Franklin,"P.O. Box 445, 323 Cursus Rd.",200
13,Lilah O. Morrison,3859 Mauris Ave,300
Writing /tmp/clients_b.csv
[20]:
clients_a = pd.read_csv("/tmp/clients_a.csv")
clients_b = pd.read_csv("/tmp/clients_b.csv")
#
# Campos diferentes en cada tabla
#
display(
set(clients_a.columns) - set(clients_b.columns),
set(clients_b.columns) - set(clients_a.columns),
)
{'amount'}
{'bonus'}
[21]:
#
# Inner join
#
display(
clients_a,
"",
clients_b,
"",
pd.merge(
clients_a,
clients_b,
how="inner",
)
)
clientId name location amount
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184
''
clientId name location bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 200
2 13 Lilah O. Morrison 3859 Mauris Ave 300
''
clientId name location amount bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184 200
[22]:
#
# Left join
#
display(
clients_a,
"",
clients_b,
"",
pd.merge(
clients_a,
clients_b,
how="left",
)
)
clientId name location amount
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184
''
clientId name location bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 200
2 13 Lilah O. Morrison 3859 Mauris Ave 300
''
clientId name location amount bonus
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929 NaN
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366 100.0
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184 200.0
[23]:
#
# Right join
#
display(
clients_a,
"",
clients_b,
"",
pd.merge(
clients_a,
clients_b,
how="right",
)
)
clientId name location amount
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184
''
clientId name location bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 200
2 13 Lilah O. Morrison 3859 Mauris Ave 300
''
clientId name location amount bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366.0 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184.0 200
2 13 Lilah O. Morrison 3859 Mauris Ave NaN 300
[24]:
#
# Outer join
#
display(
clients_a,
"",
clients_b,
"",
pd.merge(
clients_a,
clients_b,
how="outer",
)
)
clientId name location amount
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184
''
clientId name location bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 200
2 13 Lilah O. Morrison 3859 Mauris Ave 300
''
clientId name location amount bonus
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929.0 NaN
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366.0 100.0
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184.0 200.0
3 13 Lilah O. Morrison 3859 Mauris Ave NaN 300.0
[25]:
#
# Outer join
#
display(
clients_a,
"",
clients_b,
"",
pd.merge(
clients_a,
clients_b,
how="outer",
indicator=True,
)
)
clientId name location amount
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184
''
clientId name location bonus
0 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 100
1 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 200
2 13 Lilah O. Morrison 3859 Mauris Ave 300
''
clientId name location amount bonus \
0 10 Omar Y. Fletcher 6833 Mollis. Rd. 4929.0 NaN
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue 7366.0 100.0
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd. 6184.0 200.0
3 13 Lilah O. Morrison 3859 Mauris Ave NaN 300.0
_merge
0 left_only
1 both
2 both
3 right_only
Sufijos
[26]:
%%writefile /tmp/data_1.csv
clientId,info
10,Omar Y. Fletcher
11,Buffy W. Vincent
12,Mira N. Franklin
Writing /tmp/data_1.csv
[27]:
%%writefile /tmp/data_2.csv
clientId,info
10,6833 Mollis. Rd.
11,"P.O. Box 345, 8390 Ante Avenue"
12,"P.O. Box 445, 323 Cursus Rd."
Writing /tmp/data_2.csv
[28]:
df_1 = pd.read_csv("/tmp/data_1.csv")
df_2 = pd.read_csv("/tmp/data_2.csv")
#
# Sufijos generados para desambiguar los nombres
# de las columnas
#
display(
df_1,
"",
df_2,
"",
pd.merge(
df_1,
df_2,
on="clientId",
)
)
clientId info
0 10 Omar Y. Fletcher
1 11 Buffy W. Vincent
2 12 Mira N. Franklin
''
clientId info
0 10 6833 Mollis. Rd.
1 11 P.O. Box 345, 8390 Ante Avenue
2 12 P.O. Box 445, 323 Cursus Rd.
''
clientId info_x info_y
0 10 Omar Y. Fletcher 6833 Mollis. Rd.
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd.
[29]:
#
# Especificación del sufijo
#
display(
df_1,
"",
df_2,
"",
pd.merge(
df_1,
df_2,
on="clientId",
suffixes=["_LD", "_RD"],
)
)
clientId info
0 10 Omar Y. Fletcher
1 11 Buffy W. Vincent
2 12 Mira N. Franklin
''
clientId info
0 10 6833 Mollis. Rd.
1 11 P.O. Box 345, 8390 Ante Avenue
2 12 P.O. Box 445, 323 Cursus Rd.
''
clientId info_LD info_RD
0 10 Omar Y. Fletcher 6833 Mollis. Rd.
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd.
[30]:
#
# Renombramiento de las columnas
#
display(
df_1,
"",
df_2,
"",
pd.merge(
df_1,
df_2,
on="clientId",
).rename(
columns={
"info_x": "name",
"info_y": "location",
}
)
)
clientId info
0 10 Omar Y. Fletcher
1 11 Buffy W. Vincent
2 12 Mira N. Franklin
''
clientId info
0 10 6833 Mollis. Rd.
1 11 P.O. Box 345, 8390 Ante Avenue
2 12 P.O. Box 445, 323 Cursus Rd.
''
clientId name location
0 10 Omar Y. Fletcher 6833 Mollis. Rd.
1 11 Buffy W. Vincent P.O. Box 345, 8390 Ante Avenue
2 12 Mira N. Franklin P.O. Box 445, 323 Cursus Rd.