Reputation: 41
When I use pandas to read a csv file, I set the dtype
by using a dictionary (dict_types
) in order to save memory:
dict_types = {
"Id": "object",
"COD_OPE": "object",
"NUM_OPE": "float32",
"STR_ANA": "category",
"COD_EXO": "category",
"NUM_CPT": "object",
"MTT_DEV": "float32",
"SEN_MTT": "category",
}
columns = [
"Id",
"COD_OPE",
"NUM_OPE",
"STR_ANA",
"COD_EXO",
"NUM_CPT",
"MTT_DEV",
"SEN_MTT",
"DTE_OPE",
"DTE_CPT",
"DTE_VAL",
"DTE_TRT"
]
df_chunk = pd.read_csv(
"../SIDP_Operations/SAB_OPE_02_2020/SAB_OPE_02_2020.rpt",
sep="\t",
dtype=dict_types,
usecols=columns,
error_bad_lines=False,
chunksize=1000000,
parse_dates=["DTE_OPE", "DTE_CPT", "DTE_VAL", "DTE_TRT"],
infer_datetime_format=True,
)
But, when the file is loaded and I look at df.info()
and df.dtypes
, it tells me that the type of STR_ANA
is object
while I expect it should be category
.
And "COD_EXO" "SEN_MTT" is indeed type category
Why?
maybe it because the way that i use chunk ? actually i did this to read the dataframe
chunk_list=[]
for chunk in df_chunk:
chunk_list.append(chunk)
df=pd.concat(chunk_list,ignore_index=True)
Upvotes: 4
Views: 429
Reputation: 86
It's ok to use pd.concat() but it's better to add ignore_index=True
to avoid duplicity of indexes.
df = pd.concat(df_chunk, ignore_index=True)
You have filled in 4 columns for the parameter parse_dates
which do not exist in columns
.
Make sure that the column names in your csv file are the same as the names for the dtypes
and usecols
you are assigning. To be sure, use the parameter header=N
to ensure that pandas use the N line of the csv as your header.
Try this :
# add 4 columns with date
columns = [
"Id",
"COD_OPE",
"NUM_OPE",
"STR_ANA",
"COD_EXO",
"NUM_CPT",
"MTT_DEV",
"SEN_MTT",
"DTE_OPE", "DTE_CPT", "DTE_VAL", "DTE_TRT"
]
df_chunk = pd.read_csv(
"../SIDP_Operations/SAB_OPE_02_2020/SAB_OPE_02_2020.rpt",
sep="\t",
header=0,
usecols=columns,
dtype=dict_types,
parse_dates=["DTE_OPE", "DTE_CPT", "DTE_VAL", "DTE_TRT"],
infer_datetime_format=True,
chunksize=1000000,
error_bad_lines=False,
low_memory=False
)
If it's still not working, try to remove low_memory=False
.
Upvotes: 1