Reputation: 65
Hello I am trying to create a pandas dataframe from (a list of dicts or a dict of dicts) that has an eventual shape of 60,000 rows and 10,000~ columns
Values of columns are 0 or 1 and really sparse.
The list/dict object creation is fast, but when I do from_dict or from_records I get memory errors. I also tried appending to a dataframe periodically rather than at once and it still didn't work. I also tried changing all individual cells, with no avail.
By the way I am building my python object from a 100 json files that I parse.
How can I go from python objects to dataframes? Maybe I can also use something else. I eventually want to feed it to an sk-learn algorithm.
Upvotes: 2
Views: 2840
Reputation: 3189
Another thing you might try is to enable pyarrow.
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
This sped up my calls to pd.DataFrame
by an order of magnitude!
(Note that to use pyarrow, you must use pyspark>=3.0.0
if you use a newer pyarrow (ex: pyarrow>=1.0.0
). For pyspark==2.x
, it's easiest if you use pyrrrow==0.15.x
.)
Upvotes: 0
Reputation: 210822
if you have only 0
and 1
as values you should use np.bool
or np.int8
as a dtype - this will reduce your memory consumption by at least 4 times.
Here is a small demonstration:
In [34]: df = pd.DataFrame(np.random.randint(0,1,(60000, 10000)))
In [35]: df.shape
Out[35]: (60000, 10000)
In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 2.2 GB
per default pandas uses np.int32
(32 bits or 4 bytes) for integers
let's downcast it to np.int8
:
In [39]: df_int8 = df.astype(np.int8)
In [40]: df_int8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int8(10000)
memory usage: 572.2 MB
it consumes now 572 MB instead of 2.2 GB (4 times less)
or using np.bool
:
In [41]: df_bool = df.astype(np.bool)
In [42]: df_bool.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: bool(10000)
memory usage: 572.2 MB
Upvotes: 1