Reputation: 23
I need to create a somewhat unusual bar plot in matplotlib and the standard functionality does not seem to offer what I need.
I have clustered some documents and want to show the 5 most important keywords per cluster. The first problem is that I have one group per cluster which consists of 5 individual bars. The second problem is that the labels of these individual bars are important, not the same across groups and not unique either.
I have a makeshift prototype that looks like this:
I just plotted all the individual bars in the right order and separated them by empty entries. The biggest problem (aside from being ugly) is that the only way to identify the cluster is by counting the groups. It would help a lot if the clusters could be identified either by color or something else, but I cannot figure out how to do this.
Edit: Here is some requested toy data as well as the code used to produce the plot I already have.
Toy data:
The following two pandas dataframes are included in an array. The two code blocks include the results from df_list[i].to_csv()
. I hope this helps, but for the context of this problem the actual data does not really matter, so you can also just create your own dataframes.
,features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127
and
,features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198
Code:
The approach for the current solution is to combine all the individual dataframes into one dataframe, add empty entries where necessary, and plot the result.
def plot_all_clusters_words(dfs):
# target structure: word as non unique column, value as other non unique column
df_dict_list = []
for df in dfs:
for index, row in df.iterrows():
df_dict_list.append({"word": row.features, "value": row.score})
df_dict_list.append({"word": "", "value": 0})
df_dict_list = df_dict_list[:-1]
new_df = pd.DataFrame(df_dict_list)
new_df.plot.bar(x="word")
plt.show()
return new_df
Note:
I just need a way to easily identify the groups, if you know a different approach than the ones I suggested above, feel free to do so.
Upvotes: 0
Views: 193
Reputation: 80409
Calling plt.bar
for each of the dataframes, each with an own label and color, would create the following plot:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from io import StringIO
df1_str = '''features,score
0,knowledg,0.09862235117497174
1,manag,0.07812351138840486
2,innov,0.06502084705448799
3,organ,0.0561819290497529
4,km,0.05580332888282127'''
df2_str = '''features,score
0,knowledg,0.04217018718591911
1,develop,0.03423580137595049
2,manag,0.032239226503136
3,system,0.031064303713788467
4,sustain,0.029628875636649198'''
df1 = pd.read_csv(StringIO(df1_str))
df2 = pd.read_csv(StringIO(df2_str))
dfs = [df1, df2]
cluster_names = [f'cluster {i}' for i in range(1, len(dfs) + 1)]
colors = plt.cm.rainbow(np.linspace(0, 1, len(dfs)))
bar_width = 0.8 # width of individual bars
cluster_gap = 0.2 # extra distance between clusters
starts = np.append(0, np.array([len(df) + cluster_gap for df in dfs]).cumsum())
all_tickpos = [s + np.arange(len(df)) for df, s in zip(dfs, starts)]
for df, name, color, tickpos in zip(dfs, cluster_names, colors, all_tickpos):
plt.bar(tickpos, df['score'], width=bar_width, color=color, label=name)
plt.xticks(np.concatenate(all_tickpos), [f for df in dfs for f in df['features']], rotation=90)
plt.legend()
plt.tight_layout()
plt.show()
Upvotes: 1