Reputation: 43
I have a dataframe (df) containing columns ['toaddress', 'ccaddress', 'body']
I want to iterate through the index of the dataframe to get the min, max, and average amount of email addresses in toaddress and ccaddress fields as determined by counting the instance of and '@' within each field in those two columns
If all else fails, i guess I could just use df.toaddress.str.contains(r'@').sum() and divide that by the number of rows in the data frame to get the average, but I think it's just counting the rows that at least have 1 @ sign.
Upvotes: 4
Views: 10341
Reputation: 403
This answer uses https://pypi.python.org/pypi/fake-factory to generate the test data
import pandas as pd
from random import randint
from faker import Factory
fake = Factory.create()
def emails():
emailAdd = [fake.email()]
for x in range(randint(0,3)):
emailAdd.append(fake.email())
return emailAdd
df1 = pd.DataFrame(columns=['toaddress', 'ccaddress', 'body'])
for extra in range(10):
df1 = df1.append(pd.DataFrame({'toaddress':[emails()],'ccaddress':[emails()],'body':fake.text()}),ignore_index=True)
print('toaddress length is {}'.format([len(x) for x in df1.toaddress.values]))
print('ccaddress length is {}'.format([len(x) for x in df1.ccaddress.values]))
The last 2 lines is the part that counts your emails. I wasn't sure if you wanted to check for '@' specifically, maybe you can use fake-factory to generate some test data as an example?
Upvotes: 0
Reputation: 77404
You can use
df[['toaddress', 'ccaddress']].applymap(lambda x: str.count(x, '@'))
to get back the count of '@'
within each cell.
Then you can just compute the pandas max
, min
, and mean
along the row axis in the result.
As I commented on the original question, you already suggested using df.toaddress.str.contains(r'@').sum()
-- why not use df.toaddress.str.count(r'@')
if you're happy going column by column instead of the method I showed above?
Upvotes: 3
Reputation: 427
Perhaps something like this
from pandas import *
import re
df = DataFrame({"emails": ["[email protected], [email protected]",
"[email protected], none, [email protected], [email protected]"]})
at = re.compile(r"@", re.I)
def count_emails(string):
count = 0
for i in at.finditer(string):
count += 1
return count
df["count"] = df["emails"].map(count_emails)
df
Returns:
emails count
0 "[email protected], [email protected]" 2
1 "[email protected], none, [email protected], Th..." 3
Upvotes: 0
Reputation: 2627
len(filter(lambda df: df.toaddress.str.contains(r'@'),rows))
or even
len(filter(lambda df: r'@' in str(df.toaddress), rows))
Upvotes: 0