tnf
tnf

Reputation: 303

Plotting large datasets as kind=bar ineffective

I am working with a semi-large data set of approx 100,000 records. When I plot a df column as a line with the code below the plot takes approx 2 seconds.

with plt.style.context('ggplot'):
    plt.figure(3,figsize=(16,12))
    plt.subplot(411)
    df_pca_std['PC1_resid'].plot(title ="PC1 Residual", color='r')

    #If I change the plot to a bar (no other change)
    df_X_std['PC1_resid'].plot(**kind='bar'**, title ="PC1 Residual", color='r')

it takes 112 seconds and the render changes like this (jumbled x axis):

enter image description here

enter image description here

I have suppressed the axis and changed the style but neither helped. Anyone have ideas how to better render and take less time? The data being plotted is being checked for mean reversion and is better displayed as bar plot.

Upvotes: 0

Views: 168

Answers (2)

tnf
tnf

Reputation: 303

One possible solution: I do not actually need to plot bars but can use the very fast line plot and the 'fill_between' attribute to color the plot from zero to the line. The effect is similar to plotting all the bars in a fraction of the time.

Use pydatetime method of DatetimeIndex to convert Date (the df index) to an array of datetime.datetime's that can be used by matplotlib then change the plot.

plotDates = mpl.date2num(df.index.to_pydatetime())

plt.fill_between(plotDates,0,df_pca_std['PC1_resid'], alpha=0.5)

Upvotes: 0

tnf
tnf

Reputation: 303

Not the best charts visually but at least it renders. Plotted 2.1 million bars in 14.2 secs.

import pygal                                                      
bar_chart = pygal.Bar()                                            
bar_chart.add('PC1_residuals',df_X_std['PC1_resid'])                        
bar_chart.render_to_file('bar_chart.svg') 

Upvotes: 1

Related Questions