Reputation: 111
Working in Python, I'm using dask for a ~ 20gb data set. One of the columns contains integers, but for some reason, dask reads in this column as having dtype of "object". How would I convert this to numeric or float64 or integer? I've tried using dd.to_numeric, but get the following error "module 'dask.dataframe' has no attribute 'to_numeric'"
EDIT: I think this is complicated by the fact that the data has commas between thousands (e.g. 2,133 instead of 2133). Not quite sure how to deal with this. I tried using pandas to start with and using .astype(int), but that obviously didn't work.
Upvotes: 1
Views: 1146
Reputation: 13437
You should use the same pandas
parameter thousands
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({"a":['1,000', '1', '1,000,000']})\
.to_csv("out.csv", index=False)
# read as object
df = pd.read_csv("out.csv")
df = dd.read_csv("out.csv")
# read as numeric
df = pd.read_csv("out.csv", thousands=",")
df = dd.read_csv("out.csv", thousands=",")
Upvotes: 2