Reputation: 1
I have a sorted Multi-Index pandas data frame, which I need to plot in a bar chart. My data frame.
I either didn't find the solution yet, or the simple one doesn't exist, but I need to plot a bar chart on this data with Content
and Category
to be on x-axis and Installs
to be the height.
In simple terms, I need to show what each bar consist of e.g. 20% of it would be by Everyone
, 40% by Teen
etc... I'm not sure that is even possible, as the mean of means wouldn't be possible, as different sample size, hence I made an Uploads
column to calculate it, but haven't gotten that far to plot by mean.
I think plotting by cumulative would give a wrong result though.
I need to plot a bar chart with X-ticks to be the Category
, (Preferably just the first 10) then each X-tick have a bar of Content
not always 3, could be just "Everyone" and "Teen" and the height of each bar to be Installs
.
Ideally, it should look like so: Bar Chart
but each bar have bars for Content
for this specific Category
.
I have tried flattening out with DataFrame.unstack()
, but it ruins the sorting of the data frame, so used that Cat2 = Cat1.reset_index(level = [0,1])
, but need help with plotting still.
So far I have:
Cat = Popular.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum"})
Uploads = Popular[["Category","Content"]].value_counts().rename_axis(["Category","Content"]).reset_index(name = "Uploads")
Cat = pd.merge(Cat, Uploads, on = ["Category","Content"])
Cat = Cat.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum", "Uploads": "sum"})
which gives this
Then I sort it like so
Cat1 = Cat.unstack()
Cat1 = Cat1.sort_index(key = (Cat1["Installs"].sum(axis = 1)/Cat1["Uploads"].sum(axis = 1)).get, ascending = False).stack()
Thanks to one of those solutions
That's all I have.
Data Set is from Kaggle, over 600MB, don't expect anyone to download it, but at least a simple guide towards a solution.
P.S. This should help me out with splitting each dots in scatter plot below in the same way, but if not, that's fine.
P.S.S I don't have enough reputation to post pictures, so apologies for the links
Upvotes: 0
Views: 2388
Reputation: 1
ChatGPT has answered my question
import pandas as pd
import matplotlib.pyplot as plt
# create a dictionary of data for the DataFrame
data = {
'app_name': ['Google Maps', 'Uber', 'Waze', 'Spotify', 'Pandora'],
'category': ['Navigation', 'Transportation', 'Navigation', 'Music', 'Music'],
'rating': [4.5, 4.0, 4.5, 4.5, 4.0],
'reviews': [1000000, 50000, 100000, 500000, 250000]
}
# create the DataFrame
df = pd.DataFrame(data)
# set the 'app_name' and 'category' columns as the index
df = df.set_index(['app_name', 'category'])
# add a new column called "content_rating" to the DataFrame, and assign a content rating to each app
df['content_rating'] = ['Everyone', 'Teen', 'Everyone', 'Everyone', 'Teen']
# Grouping the Data by category and content_rating and getting the mean of reviews
df_grouped = df.groupby(['category','content_rating']).agg({'reviews':'mean'})
# Reset the index to make it easier to plot
df_grouped = df_grouped.reset_index()
# Plotting the stacked bar chart
df_grouped.pivot(index='category', columns='content_rating', values='reviews').plot(kind='bar', stacked=True)
This is a sample data set
What I did is I added a sum column to the dataset and sorted it by this sum.
piv = qw1.reset_index()
piv = piv.pivot_table(index='Category', columns='Content', values='per')#.plot(kind='bar', stacked = True)
piv["Sum"] = piv.sum(axis=1)
piv_10 = piv.sort_values(by = "Sum", ascending = False)[["Adult", "Everyone", "Mature", "Teen"]].head(10)
where qw1 is the multi-index data frame.
Then all had to do is to plot it:
piv_10.plot.bar(stacked = True, logy = False)
Upvotes: 0
Reputation: 3552
Edit: added the code to compute "Installs" percentage per "Category".
The dataset is big, but you should have provided mock data to easily reproduce the example, as follows:
import pandas as pd
import numpy as np
categories = ["Productivity", "Arcade", "Business", "Social"]
contents = ["Everyone", "Matute", "Teen"]
index = pd.MultiIndex.from_product(
[categories, contents], names=["Category", "Content"]
)
installs = np.random.randint(low=100, high=999, size=len(index))
df = pd.DataFrame({"Installs": installs}, index=index)
>>> df
Installs
Category Content
Productivity Everyone 149
Matute 564
Teen 301
Arcade Everyone 926
Matute 542
Teen 556
Business Everyone 879
Matute 921
Teen 323
Social Everyone 329
Matute 320
Teen 426
If you want to compute "Installs" percentage per "Category", use groupby().apply()
:
>>> df["Installs (%)"] = (
... df["Installs"]
... .groupby(by="Category", group_keys=False)
... .apply(lambda df: df / df.sum() * 100)
... )
>>> df
Installs Installs (%)
Category Content
Productivity Everyone 513 22.246314
Matute 839 36.383348
Teen 954 41.370338
Arcade Everyone 122 10.581093
Matute 519 45.013010
Teen 512 44.405898
Business Everyone 412 31.164902
Matute 698 52.798790
Teen 212 16.036309
Social Everyone 874 52.555622
Matute 326 19.603127
Teen 463 27.841251
Then you can just .unstack()
once:
>>> df = df.unstack()
>>> df
Installs Installs (%)
Content Everyone Matute Teen Everyone Matute Teen
Category
Arcade 499 904 645 24.365234 44.140625 31.494141
Business 856 819 438 40.511122 38.760057 20.728822
Productivity 705 815 657 32.384015 37.436840 30.179146
Social 416 482 238 36.619718 42.429577 20.950704
And then bar plot the feature you want:
fig, (ax, ax_percent) = plt.subplots(ncols=2, figsize=(14, 5))
df["Installs"].plot(kind="bar", rot=True, ax=ax)
ax.set_ylabel("Installs")
df["Installs (%)"].plot(kind="bar", rot=True, ax=ax_percent)
ax_percent.set_ylabel("Installs (%)")
ax_percent.set_ylim([0, 100])
plt.show()
Upvotes: 0