Reputation: 55
I have a Pandas dataframe with three columns: sentence, key phrases, category. The key phrases column contains either an empty list or words/phrases from the sentence row it comes from, like so:
Sentence | Key Phrases | Category |
---|---|---|
the red ball | ['red ball'] |
object |
a big blue box | ['blue'] |
object |
he throws the red ball | ['he throws','red ball'] |
action |
I want to check the contents of the entire key phrases column and build a frequency dictionary (or whatever is best) for every unique phrase. So in my example I'd have something like: 'red ball': 2, 'blue': 1, 'he throws': 1
Then I want to calculate the frequency distribution of these key phrases across all categories in the data frame. So in my example, object category is 100% of 'blue' ocurrences, but only 50% of 'red ball'. I am assuming the best way to do this is starting with the frequency dictionary I mentioned above?
Finally, I'd like to add another column to the dataframe which will show, for each key phrase in its row, what percentage of that key phrase's occurences exist within that category.
So the final DF would look something like this, though the aesthetic doesnt matter as long as the information is there:
Sentence | Key Phrases | Category | Key Phrase Ocurrences |
---|---|---|---|
the red ball | ['red ball'] |
object | red ball: 50% |
a big blue box | ['blue'] |
object | blue: 100% |
he throws the red ball | ['he throws', 'red ball'] |
action | he throws: 100%, red ball: 50% |
It would also be useful to have something like a dictionary where each key was the category and each value contained all the key phrases occurring within that category and their prevalence, so maybe this would be in the initial dictionary I'd create?
Upvotes: 2
Views: 856
Reputation: 8219
First we explode the df so we have line by line Key Phrases:
df2 = df.explode('Key Phrases')
df2
output:
Sentence Key Phrases Category
0 the red ball red ball object
1 a big blue box blue object
2 he throws the red ball he throws action
2 he throws the red ball red ball action
Then we create a table of occurrences as such:
df3 = df2.groupby(['Key Phrases','Category'])['Sentence'].count().unstack().fillna(0)
df3
output:
Category action object
Key Phrases
blue 0.0 1.0
he throws 1.0 0.0
red ball 1.0 1.0
Then we convert into frequencies by dividing occurrences by totals
df4 = df3.apply(lambda c: c/df3.sum(axis=1))
df4
output:
Category action object
Key Phrases
blue 0.0 1.0
he throws 1.0 0.0
red ball 0.5 0.5
This answers part 1 of your question
to fit it back to the original data, we can do the following. it is easier to do it to the already-exploded version, the dataframe df2
:
df2.merge(df4.unstack().rename('freq').reset_index(), how = 'left', on = ['Category', 'Key Phrases'])
output:
Sentence Key Phrases Category freq
-- ---------------------- ------------- ---------- ------
0 the red ball red ball object 0.5
1 a big blue box blue object 1
2 he throws the red ball he throws action 1
3 he throws the red ball red ball action 0.5
this has all the info you would like to see although admittedly not in quite a pretty format. as you said the format did not matter so much I will leave it here but you can always do further groupby
s etc
Upvotes: 0
Reputation: 621
You can try
df['Key Phrase Ocurrences'] = 100 * df.nunique(axis = 1)/df.count(axis = 1)
Upvotes: 1