Reputation: 453
I have a large pandas dataframe of shape (696, 20531) and I'd like to plot all of it's values in a histogram. Using the df.plot(kind='hist')
seems to be taking forever. Is there a better way to do this?
Upvotes: 8
Views: 5109
Reputation: 1413
Plotting large datasets with pandas is always trouble because of the memory overhead (more on that here).
A memory-efficient way to do it is to use DuckDB. You can store your data in a .parquet
file and then use SQL to compute the bins and heights for your histogram.
You can use the following snippet as a template (just replace bin_size
with a numeric value):
select
floor(SOME_COLUMN/100.0)*100.0,
count(*) as count
from 'path/to/file.parquet'
group by 1
order by 1;
Then, you can pass the results to matplotlib's bar function, which takes bin positions and height.
I implemented this in a new package called JupySQL. It is essentially doing what I've described with a couple of extra things. Here, you can see an example and some memory benchmarks demonstrating that this approach is much more efficient.
Upvotes: 0
Reputation: 8118
Another approach would be to use DataFrame.sample() - which provides a random set (with seed random_state
), of size n
, from your dataframe. So you can then plot a sample (e.g. 1000 points, with repeatable randomness) of the data e.g.
df.sample(n=1000,random_state=1).plot()
Upvotes: 1
Reputation: 12610
Use DataFrame.stack()
:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5, 10))
print(df.to_string())
0 1 2 3 4 5 6 7 8 9
0 -0.760559 0.317021 0.325524 -0.300139 0.800688 0.221835 -1.258592 0.333504 0.669925 1.413210
1 0.082853 0.041539 0.255321 -0.112667 -1.224011 -0.361301 -0.177064 0.880430 0.188540 -0.318600
2 -0.827121 0.261817 0.817216 -1.330318 -2.254830 0.447037 0.294458 0.672659 -1.242452 0.071862
3 1.173998 0.032700 -0.165357 0.572287 0.288606 0.261885 -0.699968 -2.864314 -0.616054 0.798000
4 2.134925 0.966877 -1.204055 0.547440 0.164349 0.704485 1.450768 -0.842088 0.195857 -0.448882
df.stack().hist()
Upvotes: 4