Zhe WU
Zhe WU

Reputation: 41

Pandas read_csv declaration dtype convets column incorrectly

When I use pandas to read a csv file, I set the dtype by using a dictionary (dict_types) in order to save memory:

dict_types = {
    "Id": "object",
    "COD_OPE": "object",
    "NUM_OPE": "float32",
    "STR_ANA": "category",
    "COD_EXO": "category",
    "NUM_CPT": "object",
    "MTT_DEV": "float32",
    "SEN_MTT": "category",
}
columns = [
    "Id",
    "COD_OPE",
    "NUM_OPE",
    "STR_ANA",
    "COD_EXO",
    "NUM_CPT",
    "MTT_DEV",
    "SEN_MTT",
    "DTE_OPE", 
    "DTE_CPT",
    "DTE_VAL", 
    "DTE_TRT"
]
df_chunk = pd.read_csv(
    "../SIDP_Operations/SAB_OPE_02_2020/SAB_OPE_02_2020.rpt",
    sep="\t",
    dtype=dict_types,
    usecols=columns,
    error_bad_lines=False,
    chunksize=1000000,
    parse_dates=["DTE_OPE", "DTE_CPT", "DTE_VAL", "DTE_TRT"],
    infer_datetime_format=True,
)

But, when the file is loaded and I look at df.info() and df.dtypes, it tells me that the type of STR_ANA is object while I expect it should be category. And "COD_EXO" "SEN_MTT" is indeed type category

Why?

maybe it because the way that i use chunk ? actually i did this to read the dataframe

chunk_list=[] 
for chunk in df_chunk: 
    chunk_list.append(chunk) 
df=pd.concat(chunk_list,ignore_index=True)

Upvotes: 4

Views: 429

Answers (1)

Hsuning
Hsuning

Reputation: 86

It's ok to use pd.concat() but it's better to add ignore_index=True to avoid duplicity of indexes.

df = pd.concat(df_chunk, ignore_index=True)

You have filled in 4 columns for the parameter parse_dates which do not exist in columns.

Make sure that the column names in your csv file are the same as the names for the dtypes and usecols you are assigning. To be sure, use the parameter header=N to ensure that pandas use the N line of the csv as your header.

Try this :

# add 4 columns with date
columns = [
    "Id",
    "COD_OPE",
    "NUM_OPE",
    "STR_ANA",
    "COD_EXO",
    "NUM_CPT",
    "MTT_DEV",
    "SEN_MTT",
    "DTE_OPE", "DTE_CPT", "DTE_VAL", "DTE_TRT"
]
df_chunk = pd.read_csv(
    "../SIDP_Operations/SAB_OPE_02_2020/SAB_OPE_02_2020.rpt",
    sep="\t",
    header=0,
    usecols=columns,
    dtype=dict_types,
    parse_dates=["DTE_OPE", "DTE_CPT", "DTE_VAL", "DTE_TRT"],
    infer_datetime_format=True,
    chunksize=1000000,
    error_bad_lines=False,
    low_memory=False
)

If it's still not working, try to remove low_memory=False.

Upvotes: 1

Related Questions