Reputation: 183
I have been using the module pandas for data scraping, and albeit I understood how to (), I'm still unsure about how can I compare 2 or more columns of a CSV. Taking as an example the code below, I wanted to find out, e.g. the 3 publishers who published more Action, Shooter and Platform games, separately. I wrote the code below, but the output shows "False" instead of the name of the Genre. At least I believe the top 3 publishers are correct, but I'm not sure. Could anyone have a look?
import pandas as pd
data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")
a = data['Publisher'].groupby((data['Genre'] == 'Action')).value_counts().head(3)
print(a)
s = data['Publisher'].groupby((data['Genre'] == 'Shooter')).value_counts().head(3)
print(s)
p = data['Publisher'].groupby((data['Genre'] == 'Platform')).value_counts().head(3)
print(p)
Also, I should find out the top 3 publishers who sold the most Action, Shooter and Platform Games altogether. I tried writing this, but didn't work. How can I use 3 items of the same column, at the same time, and compare them with another 2 columns? And what if I want to include a time frame, e.g., compare all these columns for the past 10 years?
import pandas as pd
data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")
a = ((data['Genre'] == 'Action') & (data['Genre'] == 'Shooter') & (data['Genre'] == 'Platform')).groupby((data['Publisher']) & (data['Global_Sales'])).value_counts().head(3)
print(a)
Upvotes: 0
Views: 125
Reputation: 85
For the first doubt, you can code:
import pandas as pd
data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")
# Group data in Genres
grouped_data = data['Publisher'].groupby((data['Genre'])).value_counts()
# By know you already have the values you want inside the "grouped_data"
# But, you can create smaller tables to see it better
a = grouped_data['Action']
s = grouped_data['Shooter']
p = grouped_data['Platform']
I did not get exactly what you need in the second question. But you can do compare Publisher and Genre using:
import pandas as pd
data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")
# group data by Publisher and Genre using .sum() for getting total sales
grouped_2 = data.groupby((data['Publisher'], data['Genre'])).sum()
# Look for a specific Publisher x Genre
specific = grouped_2.loc['Nintendo', 'Sports']
print(specific)
# For making the analysis for last 10 years
Recent_data = data[data['Year_of_Release']>2010]
# Now, you can replace "data" for "Recent_data" and make the same analysis for the last 10 years.
Upvotes: 0
Reputation: 7604
For the first 3, you can do this:
data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")
a = data[data['Genre']=='Action'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(a.head(3))
s = data[data['Genre']=='Shooter'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(s.head(3))
s = data[data['Genre']=='Platform'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(s.head(3))
Output:
Publisher Genre count
10 Activision Action 311
148 Namco Bandai Games Action 251
214 Ubisoft Action 198
Publisher Genre count
5 Activision Shooter 162
39 Electronic Arts Shooter 145
135 Ubisoft Shooter 92
Publisher Genre count
60 Nintendo Platform 112
81 THQ Platform 85
86 Ubisoft Platform 70
And for the last one you can do this:
all = data[(data['Genre']=='Platform') | (data['Genre']=='Shooter') | (data['Genre']=='Action')].groupby(by=['Publisher'], as_index=False).agg({'Global_Sales': 'sum'}).reset_index(drop=['index']).sort_values('Global_Sales', ascending=False)
print(all.head(3))
Output:
Publisher Global_Sales
195 Nintendo 623.24
11 Activision 480.94
84 Electronic Arts 287.13
Upvotes: 1
Reputation: 2887
These are a lot of questions at once:
a = data['Publisher'].groupby((data['Genre'] == 'Action')).value_counts().head(3) print(a)
In a groupby you do not specify a concrete Genre like 'Action'. That is what query is for. The point of groupby is to perform the following calculation for every Genre
In [11]: number_of_games = data.groupby('Genre')['Publisher'].value_counts()
Out[11]:
Genre Publisher
Action Activision 311
Namco Bandai Games 251
Ubisoft 198
THQ 194
Electronic Arts 183
...
Strategy Time Warner Interactive 1
Titus 1
Trion Worlds 1
Westwood Studios 1
Zoo Digital Publishing 1
Name: Publisher, dtype: int64
Note that the selection of Publisher is after the grouping, so internally pandas loops over all values in Genre and does a value_count of the Publisher
I should find out the top 3 publishers who sold the most Action, Shooter and Platform Games
Simply filter for the categories you want like this
In [25]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :]
Out[25]:
Genre Publisher
Action Activision 311
Namco Bandai Games 251
Ubisoft 198
THQ 194
Electronic Arts 183
...
Shooter Visco 1
Warashi 1
Wargaming.net 1
Xseed Games 1
id Software 1
Name: Publisher, dtype: int64
Then again you want the largest 3 Publishers per Genre and therefore you use another groupby
In [30]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :].groupby(['Genre']).head(3)
Out[30]:
Genre Publisher
Action Activision 311
Namco Bandai Games 251
Ubisoft 198
Platform Nintendo 112
THQ 85
Ubisoft 70
Shooter Activision 162
Electronic Arts 145
Ubisoft 92
Name: Publisher, dtype: int64
The function head
implicitly relies on the values being sorted. Alternatively you could use nlargest
In [31]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :].groupby(['Genre']).nlargest(3).droplevel(0)
Out[31]:
Genre Publisher
Action Activision 311
Namco Bandai Games 251
Ubisoft 198
Platform Nintendo 112
THQ 85
Ubisoft 70
Shooter Activision 162
Electronic Arts 145
Ubisoft 92
Name: Publisher, dtype: int64
The result is the same but you would need to clean up the index with droplevel
as this was appearing twice
And what if I want to include a time frame, e.g., compare all these columns for the past 10 years?
You obviously would need data for the timeframe. If you just want Games published in the last 10 years, filter the original data for games newer than 10 years. If you want to resolve which Publishers published most every year, you would create a column with the year of publication and groupby this as well. With Genre and Publisher you have already seen, that you can group by a list of features.
Upvotes: 1