Reputation: 183

(Python) How can I compare 2 or more columns with Pandas?

I have been using the module pandas for data scraping, and albeit I understood how to (), I'm still unsure about how can I compare 2 or more columns of a CSV. Taking as an example the code below, I wanted to find out, e.g. the 3 publishers who published more Action, Shooter and Platform games, separately. I wrote the code below, but the output shows "False" instead of the name of the Genre. At least I believe the top 3 publishers are correct, but I'm not sure. Could anyone have a look?

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

a = data['Publisher'].groupby((data['Genre'] == 'Action')).value_counts().head(3)
print(a)

s = data['Publisher'].groupby((data['Genre'] == 'Shooter')).value_counts().head(3)
print(s)

p = data['Publisher'].groupby((data['Genre'] == 'Platform')).value_counts().head(3)
print(p)

Also, I should find out the top 3 publishers who sold the most Action, Shooter and Platform Games altogether. I tried writing this, but didn't work. How can I use 3 items of the same column, at the same time, and compare them with another 2 columns? And what if I want to include a time frame, e.g., compare all these columns for the past 10 years?

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

a = ((data['Genre'] == 'Action') & (data['Genre'] == 'Shooter') & (data['Genre'] == 'Platform')).groupby((data['Publisher']) & (data['Global_Sales'])).value_counts().head(3)
print(a)

Upvotes: 0

Answers (3)

Murilo Malek

Reputation: 85

For the first doubt, you can code:

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

# Group data in Genres
grouped_data = data['Publisher'].groupby((data['Genre'])).value_counts()

# By know you already have the values you want inside the "grouped_data"
# But, you can create smaller tables to see it better

a = grouped_data['Action']
s = grouped_data['Shooter']
p = grouped_data['Platform']

I did not get exactly what you need in the second question. But you can do compare Publisher and Genre using:

import pandas as pd

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

# group data by Publisher and Genre using .sum() for getting total sales
grouped_2 = data.groupby((data['Publisher'], data['Genre'])).sum()

# Look for a specific Publisher x Genre
specific = grouped_2.loc['Nintendo', 'Sports']
print(specific)

# For making the analysis for last 10 years
Recent_data = data[data['Year_of_Release']>2010]

# Now, you can replace "data" for "Recent_data" and make the same analysis for the last 10 years.

Upvotes: 0

NYC Coder

Reputation: 7604

For the first 3, you can do this:

data = pd.read_csv("https://sites.google.com/site/dr2fundamentospython/arquivos/Video_Games_Sales_as_at_22_Dec_2016.csv")

a = data[data['Genre']=='Action'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(a.head(3))

s = data[data['Genre']=='Shooter'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(s.head(3))

s = data[data['Genre']=='Platform'].groupby(by=['Publisher', 'Genre'], as_index=False).size().reset_index(name='count').sort_values('count', ascending=False)
print(s.head(3))

Output:

              Publisher   Genre  count
10           Activision  Action    311
148  Namco Bandai Games  Action    251
214             Ubisoft  Action    198

           Publisher    Genre  count
5         Activision  Shooter    162
39   Electronic Arts  Shooter    145
135          Ubisoft  Shooter     92

   Publisher     Genre  count
60  Nintendo  Platform    112
81       THQ  Platform     85
86   Ubisoft  Platform     70

And for the last one you can do this:

all = data[(data['Genre']=='Platform') | (data['Genre']=='Shooter') | (data['Genre']=='Action')].groupby(by=['Publisher'], as_index=False).agg({'Global_Sales': 'sum'}).reset_index(drop=['index']).sort_values('Global_Sales', ascending=False)
print(all.head(3))

Output:

           Publisher  Global_Sales
195         Nintendo        623.24
11        Activision        480.94
84   Electronic Arts        287.13

Upvotes: 1

maow

Reputation: 2887

These are a lot of questions at once:

a = data['Publisher'].groupby((data['Genre'] == 'Action')).value_counts().head(3) print(a)

In a groupby you do not specify a concrete Genre like 'Action'. That is what query is for. The point of groupby is to perform the following calculation for every Genre

In [11]: number_of_games = data.groupby('Genre')['Publisher'].value_counts()                                                                                                                                                        
Out[11]: 
Genre     Publisher              
Action    Activision                 311
          Namco Bandai Games         251
          Ubisoft                    198
          THQ                        194
          Electronic Arts            183
                                    ... 
Strategy  Time Warner Interactive      1
          Titus                        1
          Trion Worlds                 1
          Westwood Studios             1
          Zoo Digital Publishing       1
Name: Publisher, dtype: int64

Note that the selection of Publisher is after the grouping, so internally pandas loops over all values in Genre and does a value_count of the Publisher

I should find out the top 3 publishers who sold the most Action, Shooter and Platform Games

Simply filter for the categories you want like this

In [25]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :]                                                                                                                                                 
Out[25]: 
Genre    Publisher         
Action   Activision            311
         Namco Bandai Games    251
         Ubisoft               198
         THQ                   194
         Electronic Arts       183
                              ... 
Shooter  Visco                   1
         Warashi                 1
         Wargaming.net           1
         Xseed Games             1
         id Software             1
Name: Publisher, dtype: int64

Then again you want the largest 3 Publishers per Genre and therefore you use another groupby

In [30]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :].groupby(['Genre']).head(3)                                                                                                                      
Out[30]: 
Genre     Publisher         
Action    Activision            311
          Namco Bandai Games    251
          Ubisoft               198
Platform  Nintendo              112
          THQ                    85
          Ubisoft                70
Shooter   Activision            162
          Electronic Arts       145
          Ubisoft                92
Name: Publisher, dtype: int64

The function head implicitly relies on the values being sorted. Alternatively you could use nlargest

In [31]: number_of_games.loc[['Action', 'Shooter', 'Platform'], :].groupby(['Genre']).nlargest(3).droplevel(0)                                                                                                     
Out[31]: 
Genre     Publisher         
Action    Activision            311
          Namco Bandai Games    251
          Ubisoft               198
Platform  Nintendo              112
          THQ                    85
          Ubisoft                70
Shooter   Activision            162
          Electronic Arts       145
          Ubisoft                92
Name: Publisher, dtype: int64

The result is the same but you would need to clean up the index with droplevel as this was appearing twice

And what if I want to include a time frame, e.g., compare all these columns for the past 10 years?

You obviously would need data for the timeframe. If you just want Games published in the last 10 years, filter the original data for games newer than 10 years. If you want to resolve which Publishers published most every year, you would create a column with the year of publication and groupby this as well. With Genre and Publisher you have already seen, that you can group by a list of features.

Upvotes: 1

(Python) How can I compare 2 or more columns with Pandas?

Answers (3)

Related Questions