Verificación del tipo de los datos — 4:30 min

4:30 min | Última modificación: Octubre 14, 2021

Se refiere a la correción del tipo de los datos en las tablas

[1]:

import pandas as pd

String a entero o flotante

[2]:

%%writefile /tmp/data.csv
orderId,price,percentage
1,100$,15.3%
2,120$,22.1%
3,128$,54.2%
4,155$,10.0%
5,234$,6%

Overwriting /tmp/data.csv

[3]:

df = pd.read_csv("/tmp/data.csv")

#
# Note que los tipos de las columnas price y
# percentage son object. Se debe a los caracteres
# $ y % en el archivo.
#
#
display(
    df,
    df.dtypes
)

	orderId	price	percentage
0	1	100$	15.3%
1	2	120$	22.1%
2	3	128$	54.2%
3	4	155$	10.0%
4	5	234$	6%

orderId        int64
price         object
percentage    object
dtype: object

[4]:

#
# Corrección
#
df.price = df.price.str.strip('$')
df.price = df.price.astype(int)

df.percentage = df.percentage.str.strip('%')
df.percentage = df.percentage.astype(float)

display(
    df,
    df.dtypes
)

	orderId	price	percentage
0	1	100	15.3
1	2	120	22.1
2	3	128	54.2
3	4	155	10.0
4	5	234	6.0

orderId         int64
price           int64
percentage    float64
dtype: object

Numérico a categoría

Codebook:

single
married
divorced

[5]:

%%writefile /tmp/data.csv
personId,status
1,0
2,0
3,1
3,2
4,2

Overwriting /tmp/data.csv

[6]:

df = pd.read_csv("/tmp/data.csv")

#
# El status es int64
#
display(
    df,
    df.dtypes
)

	personId	status
0	1	0
1	2	0
2	3	1
3	3	2
4	4	2

personId    int64
status      int64
dtype: object

[7]:

df.status = df.status.astype('category')

display(
    df,
    df.dtypes
)

	personId	status
0	1	0
1	2	0
2	3	1
3	3	2
4	4	2

personId       int64
status      category
dtype: object

[8]:

df.describe()

[8]:

	personId
count	5.000000
mean	2.600000
std	1.140175
min	1.000000
25%	2.000000
50%	3.000000
75%	3.000000
max	4.000000

[9]:

df.status

[9]:

0    0
1    0
2    1
3    2
4    2
Name: status, dtype: category
Categories (3, int64): [0, 1, 2]