Reputation: 864
I have some DataFrame:
import pandas as pd
import numpy as np
import seaborn as sns
np.random.seed(1)
data = {'values': range(0,200,1), 'frequency': np.random.randint(low=0, high=2000, size=200)}
df = pd.DataFrame(data)
I am trying to create a violin plot where the y-axis corresponds to the values
column and the width of the violin corresponds to the frequency
column.
I can duplicate each row by the value in the frequency
column and then call a violin plot:
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
sns.violinplot(y=repeat_df['values'])
This works...except when the resulting duplicated DataFrame has 50+ million rows. What is a better solution when working with large DataFrames?
Upvotes: 0
Views: 471
Reputation: 3721
As suggested in my comment:
Before repeating the frequencies, reduce their resolution to a percent level, by normalizing and rounding them to an integer range of 0 to 100.
This way, you are not loosing significant amount of detail but keep the amount of repetitions to a maximum of 100.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
np.random.seed(1)
n_values = 50000
# creating values with sinusoidal frequency modulation
data = {'values': range(0,n_values,1), 'frequency': np.random.randint(low=0, high=2000, size=n_values)*(np.sin(np.arange(n_values)/(n_values/50))+2)}
df = pd.DataFrame(data)
# old method: 100 million rows after repeat
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"Old method: {len(repeat_df)} Observations")
# new method: renormalize and round frequency to reduce repetitions to 100
# resulting in <2 million rows after repeat
df.frequency = np.round(df.frequency / df.frequency.max() * 100)
repeat_df = df.loc[df['values'].repeat(df['frequency'])]
print(f"New method: {len(repeat_df)} normalized Observations")
sns.violinplot(y=repeat_df['values'])
plt.show()
If your 50+ million rows stem from the values instead, I would rebin those values accordingly, e.g. to a set of 100 values.
Upvotes: 1