Reputation: 35

How do I get a word frequency count that is grouped by a second variable (Python)

I am new to Python, so its probable that I am just not wording this properly to find the answer.

Using Pandas I was able to find the most frequent N words for every record in the description field of my data. However, I have two columns; a categorical column and the description field. How to I find the most common word per category?

Ex Data:

 - Property|Description
 - House| Blue, Two stories, pool
 - Car| Green, Dented, Manual, New
 - Car| Blue, Automatic, Heated Seat
 - House|New, Furnished, HOA
 - Car|Blue, Old, Multiple Owners

My current code will return Blue=3, New=2 etc. But what I need to know is that Blue appeared for the word Car twice and once for House.

Current relevant code

words = (data.Description.str.lower().str.cat(sep=' ').split())
keywords=pandas.DataFrame(Counter(words).most_common(10), columns=['Words', 'Frequency'])

Upvotes: 3

Answers (3)

wwnde

Reputation: 26676

Data

df=pd.DataFrame({'Property':['House','Car','Car','House','Car'],'Description':['Blue,Two stories,pool','Green,Dented,Manual,New','Blue,Automatic,Heated Seat','Blue,Furnished,HOA','Blue,Old,Multiple Owners']})

Chained solution df.assign(words=df.Description.str.lower().str.split(',')).explode('words').groupby('Property')['words'].value_counts()

Explanation with a breakdown

#Create list

df['words'] = df.Description.str.lower().str.split(',')

 #Explode and count

df=df.explode('words').groupby('Property')['words'].value_counts()

Property  words          
Car       blue               2
          automatic          1
          dented             1
          green              1
          heated seat        1
          manual             1
          multiple owners    1
          new                1
          old                1
House     blue               2
          furnished          1
          hoa                1
          pool               1
          two stories        1
Name: words, dtype: int64

Upvotes: 1

sushanth

Reputation: 8302

Try this, split the row values by delimeter then apply explode to transform each element of a list-like to a row, finally Groupby

# remove leading white space's & split by delimiter
df['Description'] = df['Description'].str.strip()\
    .str.replace(",\s+", ",")\
    .str.split(',')

# apply group by to get count of each word.
print(df.explode(column='Description').
      groupby(["Property","Description"]).size().reset_index(name='count'))

ouptut,

   Property      Description  count
0       Car        Automatic      1
1       Car             Blue      2
2       Car           Dented      1
3       Car            Green      1
4       Car      Heated Seat      1
...

Upvotes: 1

user3041840

Reputation: 422

use groupby before counting: doc. and then you can do counting on each group

df = pd.DataFrame(...)

groups = df.groupby(['column_name'])
for group in groups:
    do_counting()

Upvotes: 0

How do I get a word frequency count that is grouped by a second variable (Python)

Answers (3)

Related Questions