Reputation: 1843
I am trying to plot a dataset using stripplot. Here is the head (there are 25 columns):
Labels Acidobacteria Actinobacteria Armatimonadetes Bacteroidetes
0 0 0 495 NaN 27859
1 1 0 1256 NaN 46582
2 0 0 1081 NaN 23798
3 1 0 2523 NaN 35088
4 0 0 1383 NaN 19338
I have this dataset stored in a pandas DataFrame and can plot it using:
def plot():
ax = sns.stripplot(data = df)
ax.set(xlabel='Bacteria',ylabel='Abundance')
plt.setp(ax.get_xticklabels(),rotation=45)
plt.show()
To produce this plot.
I would like to set the hues to reflect the 'Labels'
column. When I try:
sns.stripplot(x=df.columns.values.tolist(),y=df,data=df,hue='Labels')
I get:
ValueError: cannot copy sequence with size 26 to array axis with dimension 830
Upvotes: 2
Views: 3728
Reputation: 2274
I would like to expand on your answer (actually I will compact it) because this could be done in a "one-liner":
# To select specific columns:
cols = ["Acidobacteria", "Actinobacteria", "Armatimonadetes", "Bacteroidetes"]
df.set_index("Labels")[cols]\
.stack()\
.reset_index()\
.rename(columns={'level_1':'Bacteria', 0:'Abundance'})
# If you want to stack all columns but "Labels", this is enough:
df.set_index("Labels")\
.stack()\
.reset_index()\
.rename(columns={'level_1':'Bacteria', 0:'Abundance'})
The trick to avoid recreating the "Labels"
column, is to set it as index before stacking.
Output:
Labels Bacteria Abundance
0 0 Acidobacteria 0.0
1 0 Actinobacteria 495.0
2 0 Bacteroidetes 27859.0
3 1 Acidobacteria 0.0
4 1 Actinobacteria 1256.0
5 1 Bacteroidetes 46582.0
6 0 Acidobacteria 0.0
7 0 Actinobacteria 1081.0
8 0 Bacteroidetes 23798.0
9 1 Acidobacteria 0.0
10 1 Actinobacteria 2523.0
11 1 Bacteroidetes 35088.0
12 0 Acidobacteria 0.0
13 0 Actinobacteria 1383.0
14 0 Bacteroidetes 19338.0
Upvotes: 1
Reputation: 1843
So I figured it out. I had to rearrange my data by stacking and re-indexing:
cols = df.columns.values.tolist()[3:]
stacked = df[cols].stack().reset_index()
stacked.rename(columns={'level_0':'index','level_1':'Bacteria',0:'Abundance'},inplace=True)
Which outputs:
index Bacteria Abundance
0 0 Acidobacteria 0.000000
1 0 Actinobacteria 0.005003
2 0 Armatimonadetes 0.000000
3 0 Bacteroidetes 0.281586
Next I had to create a new column to assign labels to each data point:
label_col = np.array([[label for _ in range(len(cols))] for label in df['Labels']])
label_col = label_col.flatten()
stacked['Labels'] = label_col
So now:
index Bacteria Abundance Labels
0 0 Acidobacteria 0.000000 0
1 0 Actinobacteria 0.005003 0
2 0 Armatimonadetes 0.000000 0
3 0 Bacteroidetes 0.281586 0
4 0 Chlamydiae 0.000000 0
And then plot:
def plot():
ax = sns.stripplot(x='Bacteria',y='Abundance',data=stacked,hue='Labels',jitter=True)
ax.set(xlabel='Bacteria',ylabel='Abundance')
plt.setp(ax.get_xticklabels(),rotation=45)
plt.show()
plot()
To produce this graph.
Thanks for the help!
Upvotes: 4