Registros duplicados — 6:15 min
6:15 min | Última modificación: Octubre 14, 2021 | YouTube
[1]:
import pandas as pd
Posibilidades:
Registro completo duplicado.
Algunos campos duplicados, demas campos con valores diferentes
Soluciones posibles:
* Borrado de los registros completamente duplicados.
* Agregación para campos con valores diferentes.
[2]:
%%writefile /tmp/data.csv
clientId,name,phone,companyId
1,Hersch Szymanowicz,+237 (561) 702-4118,3
2,Gaven Brito,+51 (870) 799-1508,7
3,Maressa Pavlishchev,+62 (350) 377-8621,4
4,Corine Dunseith,+63 (797) 344-2571,8
5,Ramon Lawrence,+66 (557) 865-3845,6
6,Ibbie Whitehouse,+98 (495) 896-6408,1
7,Neils Capelen,+86 (361) 914-8734,6
8,Thia Malkie,+46 (564) 145-8997,4
9,Missy Folomkin,+33 (962) 798-0776,7
10,Eleanor Gallamore,+86 (366) 702-2334,5
7,Neils Capelen,+86 (361) 914-8734,6
4,Corine Dunseith,+63 (797) 344-2571,8
2,Gaven Brito,+51 (870) 799-2308,2
5,Ramon Lawrence,+66 (557) 061-3844,5
4,Corine Dunseith,+63 (797) 344-2571,8
Overwriting /tmp/data.csv
[3]:
df = pd.read_csv('/tmp/data.csv')
display(
df,
df.dtypes
)
clientId | name | phone | companyId | |
---|---|---|---|---|
0 | 1 | Hersch Szymanowicz | +237 (561) 702-4118 | 3 |
1 | 2 | Gaven Brito | +51 (870) 799-1508 | 7 |
2 | 3 | Maressa Pavlishchev | +62 (350) 377-8621 | 4 |
3 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
4 | 5 | Ramon Lawrence | +66 (557) 865-3845 | 6 |
5 | 6 | Ibbie Whitehouse | +98 (495) 896-6408 | 1 |
6 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
7 | 8 | Thia Malkie | +46 (564) 145-8997 | 4 |
8 | 9 | Missy Folomkin | +33 (962) 798-0776 | 7 |
9 | 10 | Eleanor Gallamore | +86 (366) 702-2334 | 5 |
10 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
11 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
12 | 2 | Gaven Brito | +51 (870) 799-2308 | 2 |
13 | 5 | Ramon Lawrence | +66 (557) 061-3844 | 5 |
14 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
clientId int64
name object
phone object
companyId int64
dtype: object
[4]:
#
# La función duplicaated() indica si el registro
# completo está duplicado.
#
df.duplicated()
[4]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 True
11 True
12 False
13 False
14 True
dtype: bool
[5]:
#
# Visualización de los registros duplicados
#
df[df.duplicated(keep='first')]
[5]:
clientId | name | phone | companyId | |
---|---|---|---|---|
10 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
11 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
14 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
[6]:
#
# Visualización de los registros duplicados
#
df[df.duplicated(keep='last')]
[6]:
clientId | name | phone | companyId | |
---|---|---|---|---|
3 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
6 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
11 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
[7]:
#
# Visualización de los registros duplicados
#
df[
df.duplicated(
subset=["clientId", "name"],
keep="last",
)
]
[7]:
clientId | name | phone | companyId | |
---|---|---|---|---|
1 | 2 | Gaven Brito | +51 (870) 799-1508 | 7 |
3 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
4 | 5 | Ramon Lawrence | +66 (557) 865-3845 | 6 |
6 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
11 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
[8]:
#
# Visualización de los registros duplicados
#
df[
df.duplicated(
subset=["clientId", "name"],
keep=False,
)
].sort_values(by=['clientId', 'name'])
[8]:
clientId | name | phone | companyId | |
---|---|---|---|---|
1 | 2 | Gaven Brito | +51 (870) 799-1508 | 7 |
12 | 2 | Gaven Brito | +51 (870) 799-2308 | 2 |
3 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
11 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
14 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
4 | 5 | Ramon Lawrence | +66 (557) 865-3845 | 6 |
13 | 5 | Ramon Lawrence | +66 (557) 061-3844 | 5 |
6 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
10 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
[9]:
#
# Borrado de registros duplicados
#
df[df.duplicated()]
[9]:
clientId | name | phone | companyId | |
---|---|---|---|---|
10 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
11 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
14 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
[10]:
df.drop_duplicates(inplace=True)
df
[10]:
clientId | name | phone | companyId | |
---|---|---|---|---|
0 | 1 | Hersch Szymanowicz | +237 (561) 702-4118 | 3 |
1 | 2 | Gaven Brito | +51 (870) 799-1508 | 7 |
2 | 3 | Maressa Pavlishchev | +62 (350) 377-8621 | 4 |
3 | 4 | Corine Dunseith | +63 (797) 344-2571 | 8 |
4 | 5 | Ramon Lawrence | +66 (557) 865-3845 | 6 |
5 | 6 | Ibbie Whitehouse | +98 (495) 896-6408 | 1 |
6 | 7 | Neils Capelen | +86 (361) 914-8734 | 6 |
7 | 8 | Thia Malkie | +46 (564) 145-8997 | 4 |
8 | 9 | Missy Folomkin | +33 (962) 798-0776 | 7 |
9 | 10 | Eleanor Gallamore | +86 (366) 702-2334 | 5 |
12 | 2 | Gaven Brito | +51 (870) 799-2308 | 2 |
13 | 5 | Ramon Lawrence | +66 (557) 061-3844 | 5 |