frito
frito

Reputation: 33

Python Pandas SUMIF excel equivalent

I can't figure out how to achieve a certain task in my python script.

I have a dataframe that contains media coverage for a specific topic. One of my columns names the author of the specific article for example...

enter image description here

I am trying to create a pivot table to show the total count for each journalist in the column like so

datajournalist = company1_topline.pivot_table(index='AuthorUsername', values='ContentID', aggfunc= np.count_nonzero )

which would give me something like

AuthorUsername                                                 count_nonzero
Aaron Mehta                                                      1              
Aamer Madhani                                                    1               
Aamer Madhani ; Ben Fox                                          1      

What I am looking for is a way to have the pivot table also count the names that appear in multiple value cells, to get a true count of each author. So for example the row that has "Aamer Madhani ; Ben Fox" would also count towards the count for just "Aamer Madhani" so the row for just "Aamer Madhani" would have a count of 2 instead of 1 etc.... Is there a way to do this? In excel this can be achieved with a SUMIF, but I do not know how to do this with Python/Pandas.

Desired output

AuthorUsername                                                 count_nonzero
Aaron Mehta                                                      1              
Aamer Madhani                                                    2               
Aamer Madhani ; Ben Fox                                          1     

If anyone can point me in the right direction it would be much appreciated.

Upvotes: 1

Views: 315

Answers (1)

user2246849
user2246849

Reputation: 4407

With your DataFrame having the AuthorUsername column like this:

            AuthorUsername
0              Aaron Mehta
1            Aamer Madhani
2  Aamer Madhani ; Ben Fox

You can do:

import collections

# Remove leading and trailing spaces (if any).
df['AuthorUsername'] = df['AuthorUsername'].str.strip()

# Get unique authors and their counts.
authors_counts = collections.Counter(df['AuthorUsername'].str.split('\s*;\s*').sum())

# Add to new column.
real_counts = collections.defaultdict(lambda: 1, authors_counts)
df['count_nonzero'] = [real_counts[a] for a in df['AuthorUsername']]

print(df)

Result:

            AuthorUsername  count_nonzero
0              Aaron Mehta              1
1            Aamer Madhani              2
2  Aamer Madhani ; Ben Fox              1

Edit after comments - sum more metrics:

After the comments, here is a more versatile version to also sum the Metrics column, and potentially others.

Input dataframe:

            AuthorUsername  Metrics
0              Aaron Mehta      1.3
1            Aamer Madhani      2.0
2  Aamer Madhani ; Ben Fox      0.5

Code:

df['AuthorUsername'] = df['AuthorUsername'].str.strip()
df['single_authors'] = df['AuthorUsername'].str.split('\s*;\s*')

df['count_nonzero'] = 1
single_metrics = df.explode('single_authors').groupby('single_authors').sum()
multiple_metrics = df[df['single_authors'].map(len) > 1].groupby('AuthorUsername').sum()

all_metrics = pd.concat([single_metrics, multiple_metrics]).rename_axis('AuthorUsername').reset_index()

df = df.drop(columns=['Metrics', 'count_nonzero', 'single_authors']).merge(all_metrics, how='left', on='AuthorUsername')

print(df)

Result:

            AuthorUsername  Metrics  count_nonzero
0              Aaron Mehta      1.3              1
1            Aamer Madhani      2.5              2
2  Aamer Madhani ; Ben Fox      0.5              1

Upvotes: 4

Related Questions