Kevin
Kevin

Reputation: 65

pandas dataframe creation from dict or list too slow, any suggestions?

Hello I am trying to create a pandas dataframe from (a list of dicts or a dict of dicts) that has an eventual shape of 60,000 rows and 10,000~ columns

Values of columns are 0 or 1 and really sparse.

The list/dict object creation is fast, but when I do from_dict or from_records I get memory errors. I also tried appending to a dataframe periodically rather than at once and it still didn't work. I also tried changing all individual cells, with no avail.

By the way I am building my python object from a 100 json files that I parse.

How can I go from python objects to dataframes? Maybe I can also use something else. I eventually want to feed it to an sk-learn algorithm.

Upvotes: 2

Views: 2840

Answers (2)

K.S.
K.S.

Reputation: 3189

Another thing you might try is to enable pyarrow.

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

This sped up my calls to pd.DataFrame by an order of magnitude!

(Note that to use pyarrow, you must use pyspark>=3.0.0 if you use a newer pyarrow (ex: pyarrow>=1.0.0). For pyspark==2.x, it's easiest if you use pyrrrow==0.15.x.)

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210822

if you have only 0 and 1 as values you should use np.bool or np.int8 as a dtype - this will reduce your memory consumption by at least 4 times.

Here is a small demonstration:

In [34]: df = pd.DataFrame(np.random.randint(0,1,(60000, 10000)))

In [35]: df.shape
Out[35]: (60000, 10000)

In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 2.2 GB

per default pandas uses np.int32 (32 bits or 4 bytes) for integers

let's downcast it to np.int8:

In [39]: df_int8 = df.astype(np.int8)

In [40]: df_int8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int8(10000)
memory usage: 572.2 MB

it consumes now 572 MB instead of 2.2 GB (4 times less)

or using np.bool:

In [41]: df_bool = df.astype(np.bool)

In [42]: df_bool.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: bool(10000)
memory usage: 572.2 MB

Upvotes: 1

Related Questions