Reputation: 55
I am trying to plot a graph in python which would show me the occurrence of items over time. So I want to find out how many items that match two categories appear in each year and plot a graph based on these.
This is my data in excel:
What I want to end up with is a list of movies which are both fantasy and action and how many times they appear in every year. This is the end result which I have gotten to (which is correct)
i.e. there are 2 movies in 2004 which are both fantasy and action and there is one movie in 2005 which is both fantasy and action etc.
Here are the steps which I have taken to get to the result:
#import data:
data = pd.read_csv("data.csv")
#put all fantasy movies in a list:
fantasy_movies = data[['Name', 'Genre']][(data['Genre'] == 'Fantasy')]
fantasy_movies.rename(columns={'Genre' : 'Fantasy'}, inplace = True)
#put all action movies in a list:
action_movies = data[['Name', 'Genre']][(data['Genre'] == 'Action')]
action_movies.rename(columns={'Genre' : 'Action'}, inplace = True)
#merge the two datasets:
action_fantasy = pd.merge(fantasy_movies, action_movies)
#obtain a list of unique movie names:
unique = action_fantasy.Name.unique()
#make dates the column and unique names the rows
filter_data = data[(data.Name.isin(unique))]
table = filter_data.pivot_table(filter_data, index = ['Name'],columns=['year'])
#replace all NaNs with zero
table1 = table.fillna(0)
#Count items in years
table1.gt(0).astype(int).sum(axis=0)
Now, from here I would like to do some kind of graph (I'm thinking of a bar graph) using Matplotlib that would have years on the bottom and going up by the amount as per the table1 result. I am struggling to create one, even though it should technically be as easy as putting data on the x column and data on the y column.
Like the code from W3 Schools: https://www.w3schools.com/python/matplotlib_bars.asp
x = np.array(["A", "B", "C", "D"])
y = np.array([3, 8, 1, 10])
plt.bar(x,y)
plt.show()
I wonder if my data is in the wrong format? What would be my x-axis and y-axis?
Upvotes: 1
Views: 6457
Reputation: 41327
I wonder if my data is in the wrong format?
Not "wrong" per se, but it has a MultiIndex which is a bit of a hassle and unnecessary here. I suggest getting rid of the MultiIndex with Series.droplevel
before plotting via pandas/matplotlib/seaborn.
After dropping the MultiIndex, use Series.plot.bar
which plots the values as y against the index as x:
counts = table1.gt(0).astype(int).sum(axis=0).droplevel(0)
# year
# 2004 2
# 2005 1
# 2011 1
# 2016 1
# 2018 2
# dtype: int64
counts.plot.bar(ylabel='total')
If you really want to use plt.bar
, I suggest resetting the Series into a DataFrame and then plotting the total against the range index:
counts = table1.gt(0).astype(int).sum(axis=0).droplevel(0).reset_index(name='total')
# year total
# 0 2004 2
# 1 2005 1
# 2 2011 1
# 3 2016 1
# 4 2018 2
plt.bar(counts.index, counts.total)
plt.xticks(ticks=counts.index, labels=counts.year)
plt.xlabel('year')
plt.ylabel('total')
Alternatively pass the DataFrame into sns.barplot
:
import seaborn as sns
sns.barplot(data=counts, x='year', y='total')
Upvotes: 1