Reputation: 709
I have a very large pandas dataframe: ~300,000 columns and ~17,520 rows. The pandas dataframe is called result_full
. I am attempting to replace all of the strings "NaN"
with numpy.nan
:
result_full.replace(["NaN"], np.nan, inplace = True)
Here is where I get MemoryError
Is there a memory efficient way to drop these strings in my dataframe? I tried result_full.dropna()
but it didn't work because they are technically string that are "NaN"
Upvotes: 1
Views: 2677
Reputation: 540
One of the issues could be because of using a 32-bit Machine as it can process a maximum of 2GB of data at a time. If possible, scale up to a 64-bit machine to avoid problems in the future.
Meanwhile, there could be a hack to this. Convert the dataframe to CSV by using the df.to_csv()
option. Once that's done, if you look into the documentation of the df.read_csv()
in the pandas documentation of read_csv, you shall notice this parameter
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’`.
So,it shall recognize the string 'NaN' as np.nan and your problem shall be solved.
Meanwhile, if you are directly creating this Dataframe through a CSV, you could use this parameter to avoid the memory problem. Hope it helps. Cheers!
Upvotes: 3