Thodoris P
Thodoris P

Reputation: 563

Python pandas - value_counts not working properly

Based on this post on stack i tried the value counts function like this

df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))

and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique. Data example:

     Action  Adventure   Casual  Design & Illustration   Early Access    Education   Free to Play    Indie   Massively Multiplayer   Photo Editing   RPG     Racing  Simulation  Software Training   Sports  Strategy    Utilities   Video Production    Web Publishing Accounting  Action  Adventure   Animation & Modeling    Audio Production    Casual  Design & Illustration   Early Access    Education   Free to Play    Indie   Massively Multiplayer   Photo Editing   RPG Racing  Simulation  Software Training   Sports  Strategy    Utilities   Video Production    Web Publishing  nan
0   nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan

(i have pasted the head and the first row only)

I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets

example :[Action,Indie] so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values. So what i did is that:

for i in df1['genres'].tolist():
if str(i) != 'nan':

    i = i[1:-1]
    new.append(i)
else:
    new.append('nan')

Upvotes: 1

Views: 3140

Answers (1)

jezrael
jezrael

Reputation: 862511

You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace

import pandas as pd

df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")


df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')

df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))

#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
    print df
    #remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
       u'initial_price', u'is_free', u'metacritic', u'release_date',
       u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
       u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
       u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
       u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
       u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
       u'WebPublishing'],
      dtype='object')

Upvotes: 1

Related Questions