dodoelhos
dodoelhos

Reputation: 217

Make Plotly scatter plots faster for large datasets - Python

I have a dataset that is about 300,000 rows. Here is a snippet of the dataset.

    id                datetime            results
0   30  2020-09-29 14:55:21+00             0.0424
1   30  2020-09-29 14:55:23+00             0.0424
2   31  2020-09-29 14:55:24+00             0.0424
3   31  2020-09-29 14:55:25+00             0.0424
4   32  2020-09-29 14:55:26+00             0.0424
5   32  2020-09-29 14:55:27+00             0.0424

I tried to use matplotlib for a scatter plot but it was really slow. I then moved on to Plotly as i have seen that scattergl creates interactive graphs fast which is exactly what i need. However, when i start plotting anything above 100,000 the graph is really slow and takes alot of time to render.

Here is the code i implemented:

import plotly.graph_objects as go

# readings is a pandas dataframe containing the data

def plot_scatter(df, x_column, y_column):
    fig = go.Figure(data=go.Scattergl(x=df[x_column], y=df[y_column], mode='markers')))

fig.show()

plot_scatter(readings, 'datetime', 'results')

I also tried to split the plotted points by id (as in each set of points with a certain id will have their own color, and for the id to show in the legend) but i tried several methods with little luck.

I would really appreciate some help on how to make a fast scatter plot(maybe there is something better than scattergl) and how to split the data on the graph by id.

Upvotes: 11

Views: 11221

Answers (1)

Jonvdrdo
Jonvdrdo

Reputation: 411

Your problem arises from a design choice which occurs in most visualizations libraries. Namely, Plotly's Python bindings sends all the data to the front-end, making it both slow to render and interact with. Plotly-Resampler tackles this by performing dynamic aggregation (i.e. only showing a limited amount of points in the on the graph), and rendering new points after a user-graph interaction. See gif below for example usage:

enter image description here


Mapping this to your code issue:

import pandas as pd; import numpy as np
import plotly.graph_objs as go
from plotly_resampler import FigureResampler

n = 300_000  # nbr of datapoints to plot

df = pd.DataFrame(
    data=np.sin(np.arange(n) / 50),
    index=pd.date_range(
       '2022/04/04 10:41', freq='s', periods=n
    ).rename('timestamp'),
    columns=['result']
).reset_index()
df
    timestamp   result
0   2022-04-04 10:41:00     0.000000
1   2022-04-04 10:41:01     0.019999
2   2022-04-04 10:41:02     0.039989
3   2022-04-04 10:41:03     0.059964
4   2022-04-04 10:41:04     0.079915
...     ...     ...

The visualization code:

# 1. Wrap the Plotly-Figure with the FigureResampler
fig = FigureResampler(go.Figure())
for  c in set(df.columns).difference({'timestamp'}):
    fig.add_trace(
        go.Scattergl(name=c, mode='lines+markers', showlegend=True),
        # OPTIONAL: for faster graph construction 
        #    -> add the trace data by using hf_x and hf_y
        hf_x=df['timestamp'], 
        hf_y=df[c]
    )

# 2. Instead of fig.show() call fig.show_dash()
fig.show_dash(mode='inline')

Hope this helps you further!
(disclaimer - I am the main developer of Plotly-Resampler)

Upvotes: 25

Related Questions