Boxplot and Scatterplot python

Question

I have a time series data on which I would like to build a overlayed scatterplot and boxplot. The data is as so:

    TokenUsed   date
0   8   2020-01-05
1   8   2020-01-05
2   8   2020-01-05
3   8   2020-01-05
4   8   2020-01-05
... ... ...
51040   7   2020-02-23
51041   7   2020-02-23
51042   7   2020-02-23
51043   7   2020-02-23
51044   7   2020-02-23

This time series can be neatly shown as a boxplot (I've had trouble with the x-axis being a date, but solved it converting it to string). Now I would like to show only the data on which sum is superior to a threshold (>81) in my case. The code and the resulting image are below:

fig, ax = plt.subplots(figsize = (12,6))  



ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])


ax.axhline(81)

plt.locator_params(axis='x', nbins=10)
plt.show()

When I add a scatter plot, I get image (2) and by filtering only those >81 I get image(3). What I don't understand is why it can't seem to match the x-axis between the two graphs!

Code:

fig, ax = plt.subplots(figsize = (12,6))  



ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax= ax, whis=[0,100])
# Without filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df, ax= ax,color=".25")
# Filter
ax = sns.scatterplot(x="date", y="TokenUsed", data=df[df["TokenUsed"]>81], ax= ax,color=".25")

ax.axhline(81)

plt.locator_params(axis='x', nbins=10)
plt.show()

Tom · Accepted Answer

Answer:

Try editing your filtering such that no rows of df are actually removed. That is, apply a mask specifically on the TokenUsed column, such that values are replaced with NaN (rather than the whole row being removed). Here's how I would implement this:

#make a new copy df, use that to plot
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")

Explanation

Caveat: this is really my understanding of what is going on from my own observations; I am not actually aware of the implementation behind the scenes

seaborn is less aware of the dates then you are expecting. When creating the boxplot and using the date column for the x-axis, seaborn groups the data by each unique value in the date column. It orders these strings and then creates an integer position for each of them (starting from 0). The y-data are then plotted against these integer values, and the x-tick-labels are replaced with the corresponding string value. So in your case, there are 8 unique date strings, and they are plotted at x-positions from 0 to 7. Also, it doesn't actually matter that they look like dates. You could add more string values to the date column; their position relative to prior data would depend on their alphabetical order (e.g. I would guess the string '00-00-0000' would appear first and the string '999' would appear last).

The filter df[df["TokenUsed"]>81] removes any rows where the TokenUsed value is below 81. This means that the filtered DataFrame will not have as many string date values as the original data. This then creates the unexpected result when plotting. In your filtered data, the first date with values above 81 is 2020-02-09. So in the scatterplot call, those values get plotted at x=0, which is confusing because the values from 2020-01-05 were plotted at x=0 in the call to boxplot.

The fix is to make sure all the original dates are still present in the filtered data, but to replace the filtered out values with NaN so nothing gets plotted.

Here is the example I used to test this:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# fake data, only one date has values over 80
dr = ['01-05-2020'] * 100 + ['01-12-2020'] * 100 + ['01-19-2020'] * 100
data = list(np.random.randint(0,80,200)) + list(np.random.randint(50,150,100))
df = pd.DataFrame({'date':dr, 'TokenUsed':data})

fig, ax = plt.subplots(figsize = (12,6))
ax = sns.boxplot(x="date", y="TokenUsed", data=df, ax=ax, whis=[0,100])

df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)

# the fix
df2 = df.copy()
df2['TokenUsed'] = df2['TokenUsed'].mask(df2['TokenUsed'] < 81)
ax = sns.scatterplot(x="date", y="TokenUsed", data=df2, ax= ax,color=".25")

ax.axhline(81)
plt.locator_params(axis='x', nbins=10)
plt.show()

If I use the same filter that you applied, I get the same issue.

Boxplot and Scatterplot python

Answers (1)

Answer:

Explanation

Related Questions