sparrow
sparrow

Reputation: 460

how to create interactive graph on a large data set?

I am trying to create an interactive graph using holoviews on a large data set. Below is a sample of the data file called trackData.cvs

Event         Time             ID     Venue    
Javeline      11:25:21:012345  JVL    Dome
Shot pot      11:25:22:778929  SPT    Dome
4x4           11:25:21:993831  FOR    Track
4x4           11:25:22:874293  FOR    Track
Shot pot      11:25:21:087822  SPT    Dome
Javeline      11:25:23:878792  JVL    Dome
Long Jump     11:25:21:892902  LJP    Aquatic
Long Jump     11:25:22:799422  LJP    Aquatic

This is how I read the data and plot a scatter plot.

trackData = pd.read_csv('trackData.csv')
scatter = hv.Scatter(trackData, 'Time', 'ID')
scatter

Because this data set is quite huge, zooming in and out of the scatter plot is very slow and would like to speed this process up. I researched and found about holoviews decimate that is recommended on large datasets but I don't know how to use in the above code. Most cases I tried seems to throw an error. Also, is there a way to make sure the Time column is converted to micros? Thanks in advance for the help

Upvotes: 2

Views: 876

Answers (2)

James A. Bednar
James A. Bednar

Reputation: 3255

Datashader indeed does not handle categorical axes as used here, but that's not so much a limitation of the software than of my imagination -- what should it be doing with them? A Datashader scatterplot (Canvas.points) is meant for a very large number of points located on a continuously indexed 2D plane. Such a plot approximates a 2D probability distribution function, accumulating points per pixel to show the density in that region, and revealing spatial patterns across pixels.

A categorical axis doesn't have the same properties that a continuous numerical axis does, because there's no spatial relationship between adjacent values. Specifically in this case, there's no apparent meaning to an ordering of the ID field (it appears to be a letter code for a sporting event type), so I can't see any meaning to accumulating across ID values per pixel the way Datashader is designed to do. Even if you convert IDs to numbers, you'll either just get random-looking noise (if there are more ID values than vertical pixels), or a series of spotty lines (if there are fewer ID values than pixels).

Here, maybe there are only a few dozen or so unique ID values, but many, many time measurements? In that case most people would use a box, violin, histogram, or ridge plot per ID, to see the distribution of values for each ID value. A Datashader points plot is a 2D histogram, but if one axis is categorical you're really dealing with a set of 1D histograms, not a single combined 2D histogram, so just use histograms if that's what you're after.

If you really do want to try plotting all the points per ID as raw points, you could do that using vertical spike events as in https://examples.pyviz.org/iex_trading/IEX_stocks.html . You can also add some vertical jitter and then use Datashader, but that's not something directly supported right now, and it doesn't have the clear mathematical interpretation that a normal Datashader plot does (in terms of approximating a density function).

stocks taxi tips

Upvotes: 2

Sander van den Oord
Sander van den Oord

Reputation: 12808

The disadvantage of decimate() is that it downsamples your datapoints.
I think you need datashader() here, but datashader doesn't like that ID is a categorical variable instead of a numerical value.

So a solution could be to convert your categorical variable to a numerical code.

See the code example below for both hvPlot (which I prefer) and HoloViews:

import io
import pandas as pd
import hvplot.pandas
import holoviews as hv
# dynspread is for making point sizes larger when using datashade
from holoviews.operation.datashader import datashade, dynspread

# sample data
text = """
    Event         Time             ID     Venue    
    Javeline      11:25:21:012345  JVL    Dome
    Shot pot      11:25:22:778929  SPT    Dome
    4x4           11:25:21:993831  FOR    Track
    4x4           11:25:22:874293  FOR    Track
    Shot pot      11:25:21:087822  SPT    Dome
    Javeline      11:25:23:878792  JVL    Dome
    Long Jump     11:25:21:892902  LJP    Aquatic
    Long Jump     11:25:22:799422  LJP    Aquatic
"""

# create dataframe and parse time
df = pd.read_csv(io.StringIO(text), sep='\s{2,}', engine='python')
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
df = df.set_index('Time').sort_index()

# get a column that converts categorical id's to numerical id's
df['ID'] = pd.Categorical(df['ID'])
df['ID_code'] = df['ID'].cat.codes

# use this to overwrite numerical yticks with categorical yticks
yticks=[(0, 'FOR'), (1, 'JVL'), (2, 'LJP'), (3, 'SPT')]

# this is the hvplot solution: set datashader=True
df.hvplot.scatter(
    x='Time', 
    y='ID_code', 
    datashade=True,
    dynspread=True,
    padding=0.05, 
).opts(yticks=yticks)

# this is the holoviews solution
scatter = hv.Scatter(df, kdims=['Time'], vdims=['ID_code'])
dynspread(datashade(scatter)).opts(yticks=yticks, padding=0.05)


More info on datashader and decimate:
http://holoviews.org/user_guide/Large_Data.html

Resulting plot:

using datashader for large data

Upvotes: 2

Related Questions