bluechips
bluechips

Reputation: 43

Python: Count instances of a specific character in all rows within a dataframe column

I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']

I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '@' within each field in those two columns

If all else fails, i guess I could just use df.toaddress.str.contains(r'@').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 @ sign.

Upvotes: 4

Views: 10341

Answers (4)

memebrain
memebrain

Reputation: 403

This answer uses https://pypi.python.org/pypi/fake-factory to generate the test data

import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()

def emails():
    emailAdd = [fake.email()]
    for x in range(randint(0,3)):
        emailAdd.append(fake.email())

    return emailAdd

df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])

for extra in range(10):
    df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)

print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))

The last 2 lines is the part that counts your emails. I wasn't sure if you wanted to check for '@' specifically, maybe you can use fake-factory to generate some test data as an example?

Upvotes: 0

ely
ely

Reputation: 77404

You can use

df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '@'))

to get back the count of '@' within each cell.

Then you can just compute the pandas max, min, and mean along the row axis in the result.

As I commented on the original question, you already suggested using df.toaddress.str.contains(r'@').sum() -- why not use df.toaddress.str.count(r'@') if you're happy going column by column instead of the method I showed above?

Upvotes: 3

Joseph Stover
Joseph Stover

Reputation: 427

Perhaps something like this

from pandas import *
import re

df = DataFrame({"emails": ["[email protected], [email protected]", 
                           "[email protected], none, [email protected], [email protected]"]})

at = re.compile(r"@", re.I)
def count_emails(string):
    count = 0
    for i in at.finditer(string):
        count += 1
    return count

df["count"] = df["emails"].map(count_emails)

df

Returns:

    emails                                                  count
0   "[email protected], [email protected]"                     2
1   "[email protected], none, [email protected], Th..."     3

Upvotes: 0

Dmitry Rubanovich
Dmitry Rubanovich

Reputation: 2627

len(filter(lambda df: df.toaddress.str.contains(r'@'),rows))

or even

len(filter(lambda df: r'@' in str(df.toaddress), rows))

Upvotes: 0

Related Questions