Cactus Philosopher
Cactus Philosopher

Reputation: 864

seaborn violin plot with frequency and values in separate columns

I have some DataFrame:

import pandas as pd
import numpy as np
import seaborn as sns

np.random.seed(1)
data = {'values': range(0,200,1), 'frequency': np.random.randint(low=0, high=2000, size=200)}
df = pd.DataFrame(data)

I am trying to create a violin plot where the y-axis corresponds to the values column and the width of the violin corresponds to the frequency column.

I can duplicate each row by the value in the frequency column and then call a violin plot:

repeat_df = df.loc[df['values'].repeat(df['frequency'])]
sns.violinplot(y=repeat_df['values'])

enter image description here

This works...except when the resulting duplicated DataFrame has 50+ million rows. What is a better solution when working with large DataFrames?

Upvotes: 0

Views: 471

Answers (1)

Christian Karcher
Christian Karcher

Reputation: 3721

As suggested in my comment:

Before repeating the frequencies, reduce their resolution to a percent level, by normalizing and rounding them to an integer range of 0 to 100.

This way, you are not loosing significant amount of detail but keep the amount of repetitions to a maximum of 100.

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

np.random.seed(1)
n_values = 50000
# creating values with sinusoidal frequency modulation
data = {'values': range(0,n_values,1), 'frequency': np.random.randint(low=0, high=2000, size=n_values)*(np.sin(np.arange(n_values)/(n_values/50))+2)}

df = pd.DataFrame(data)

# old method: 100 million rows after repeat
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"Old method: {len(repeat_df)} Observations")

# new method: renormalize and round frequency to reduce repetitions to 100
# resulting in <2 million rows after repeat 
df.frequency = np.round(df.frequency / df.frequency.max() * 100)
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"New method: {len(repeat_df)} normalized Observations")

sns.violinplot(y=repeat_df['values'])
plt.show()

enter image description here

If your 50+ million rows stem from the values instead, I would rebin those values accordingly, e.g. to a set of 100 values.

Upvotes: 1

Related Questions