user3424575
user3424575

Reputation: 111

Dask dataframe: how to convert a column dtype from object to numeric

Working in Python, I'm using dask for a ~ 20gb data set. One of the columns contains integers, but for some reason, dask reads in this column as having dtype of "object". How would I convert this to numeric or float64 or integer? I've tried using dd.to_numeric, but get the following error "module 'dask.dataframe' has no attribute 'to_numeric'"

EDIT: I think this is complicated by the fact that the data has commas between thousands (e.g. 2,133 instead of 2133). Not quite sure how to deal with this. I tried using pandas to start with and using .astype(int), but that obviously didn't work.

Upvotes: 1

Views: 1146

Answers (1)

rpanai
rpanai

Reputation: 13437

You should use the same pandas parameter thousands

import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({"a":['1,000', '1', '1,000,000']})\
       .to_csv("out.csv", index=False)

# read as object
df = pd.read_csv("out.csv")
df = dd.read_csv("out.csv")

# read as numeric
df = pd.read_csv("out.csv", thousands=",")
df = dd.read_csv("out.csv", thousands=",")

Upvotes: 2

Related Questions