pandas dataframe creation from dict or list too slow, any suggestions?

Question

Hello I am trying to create a pandas dataframe from (a list of dicts or a dict of dicts) that has an eventual shape of 60,000 rows and 10,000~ columns

Values of columns are 0 or 1 and really sparse.

The list/dict object creation is fast, but when I do from_dict or from_records I get memory errors. I also tried appending to a dataframe periodically rather than at once and it still didn't work. I also tried changing all individual cells, with no avail.

By the way I am building my python object from a 100 json files that I parse.

How can I go from python objects to dataframes? Maybe I can also use something else. I eventually want to feed it to an sk-learn algorithm.

MaxU - stand with Ukraine · Accepted Answer

if you have only 0 and 1 as values you should use np.bool or np.int8 as a dtype - this will reduce your memory consumption by at least 4 times.

Here is a small demonstration:

In [34]: df = pd.DataFrame(np.random.randint(0,1,(60000, 10000)))

In [35]: df.shape
Out[35]: (60000, 10000)

In [36]: df.info()

RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 2.2 GB

per default pandas uses np.int32 (32 bits or 4 bytes) for integers

let's downcast it to np.int8:

In [39]: df_int8 = df.astype(np.int8)

In [40]: df_int8.info()

RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int8(10000)
memory usage: 572.2 MB

it consumes now 572 MB instead of 2.2 GB (4 times less)

or using np.bool:

In [41]: df_bool = df.astype(np.bool)

In [42]: df_bool.info()

RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: bool(10000)
memory usage: 572.2 MB

pandas dataframe creation from dict or list too slow, any suggestions?

Answers (2)

Related Questions