Reputation: 217
I have a dataset that is about 300,000 rows. Here is a snippet of the dataset.
id datetime results
0 30 2020-09-29 14:55:21+00 0.0424
1 30 2020-09-29 14:55:23+00 0.0424
2 31 2020-09-29 14:55:24+00 0.0424
3 31 2020-09-29 14:55:25+00 0.0424
4 32 2020-09-29 14:55:26+00 0.0424
5 32 2020-09-29 14:55:27+00 0.0424
I tried to use matplotlib for a scatter plot but it was really slow. I then moved on to Plotly as i have seen that scattergl creates interactive graphs fast which is exactly what i need. However, when i start plotting anything above 100,000 the graph is really slow and takes alot of time to render.
Here is the code i implemented:
import plotly.graph_objects as go
# readings is a pandas dataframe containing the data
def plot_scatter(df, x_column, y_column):
fig = go.Figure(data=go.Scattergl(x=df[x_column], y=df[y_column], mode='markers')))
fig.show()
plot_scatter(readings, 'datetime', 'results')
I also tried to split the plotted points by id (as in each set of points with a certain id will have their own color, and for the id to show in the legend) but i tried several methods with little luck.
I would really appreciate some help on how to make a fast scatter plot(maybe there is something better than scattergl) and how to split the data on the graph by id.
Upvotes: 11
Views: 11221
Reputation: 411
Your problem arises from a design choice which occurs in most visualizations libraries. Namely, Plotly's Python bindings sends all the data to the front-end, making it both slow to render and interact with. Plotly-Resampler tackles this by performing dynamic aggregation (i.e. only showing a limited amount of points in the on the graph), and rendering new points after a user-graph interaction. See gif below for example usage:
Mapping this to your code issue:
import pandas as pd; import numpy as np
import plotly.graph_objs as go
from plotly_resampler import FigureResampler
n = 300_000 # nbr of datapoints to plot
df = pd.DataFrame(
data=np.sin(np.arange(n) / 50),
index=pd.date_range(
'2022/04/04 10:41', freq='s', periods=n
).rename('timestamp'),
columns=['result']
).reset_index()
df
timestamp result
0 2022-04-04 10:41:00 0.000000
1 2022-04-04 10:41:01 0.019999
2 2022-04-04 10:41:02 0.039989
3 2022-04-04 10:41:03 0.059964
4 2022-04-04 10:41:04 0.079915
... ... ...
The visualization code:
# 1. Wrap the Plotly-Figure with the FigureResampler
fig = FigureResampler(go.Figure())
for c in set(df.columns).difference({'timestamp'}):
fig.add_trace(
go.Scattergl(name=c, mode='lines+markers', showlegend=True),
# OPTIONAL: for faster graph construction
# -> add the trace data by using hf_x and hf_y
hf_x=df['timestamp'],
hf_y=df[c]
)
# 2. Instead of fig.show() call fig.show_dash()
fig.show_dash(mode='inline')
Hope this helps you further!
(disclaimer - I am the main developer of Plotly-Resampler)
Upvotes: 25