Reputation: 459
I'm trying to
-match the number of times a dollar/cent/percent appears in a column in a data frame -create a function that lists all of these matches -create a function that count the number of matches
I'm able to successfully use regex to find the matches, but I'm unable to get a list to appear and count them. I tried using str.extract()
to list out all of the matches and str.contains().sum()
to count the number of matches. However, I can't get these to work at the same time and I'm not sure if this is even the right approach.
Here's what I have so far:
(?:^|\s)(\$?(?:\d[ ,.]?)+%?)(?=\s|$)
This will match any dollar/cent or percent, and accounts for a bunch of different typos that I'm trying to include so the regex string is fine.
import re
import pandas as pd
sample_csv = pd.DataFrame({'SMS_text': ['$1', '214-233-4455'],
'SMS_number': ['1', '1'],
'Amount': ['12.12%', '$10.12'],
'Percent': ['10%', '1']})
sample_csv.Amount.str.extract(r'(\$?(?:\d[ ,.]?)+%?)(?=\s|$)')
This will list out the matches from the Amount column.
0 12.12%
1 $10.12
Name: Amount, dtype: object
And if I substitute str.extract()
with sample_csv.Amount.str.contains(r'(\$?(?:\d[ ,.]?)+%?)(?=\s|$)').sum()
it will tell me how many matches there are (i.e. 2) but again, I can't get them to work at the same time.
Is this the right approach or am I setting this up wrong? Seems like maybe a findall()
or maybe an re.compile()
might work better, but I'm not sure how to get these to work properly.
My desired output from above would be something like:
regex_number_matches = ['12.12%', '$10.12']
regex_number_matches_count = 2
Upvotes: 1
Views: 369
Reputation: 210912
It seems (to me) you want to find those values in the Amount
column that can't be converted to numbers:
Assuming you have the following DF:
In [107]: sample_csv
Out[107]:
Amount Percent SMS_number SMS_text
0 12.12% 10% 1 $1
1 $10.12 1 1 214-233-4455
2 1.23 2 1 214-233-4455
3 33,5 3 1 214-233-4455
first create a boolean mask of those rows where Amount can't be converted to numeric value:
In [108]: mask = pd.to_numeric(sample_csv['Amount'], errors='coerce').isnull()
In [109]: mask
Out[109]:
0 True
1 True
2 False
3 True
Name: Amount, dtype: bool
now you can easily filter DF, using this mask:
In [110]: sample_csv.loc[mask, 'Amount']
Out[110]:
0 12.12%
1 $10.12
3 33,5
Name: Amount, dtype: object
In [111]: sample_csv.loc[mask, 'Amount'].tolist()
Out[111]: ['12.12%', '$10.12', '33,5']
In [112]: len(sample_csv.loc[mask, 'Amount'])
Out[112]: 3
Upvotes: 1
Reputation: 38415
If you want total number of times % or $ appears in your dataframe, try
sample_csv.stack().str.contains('%|\$').sum()
It returns 4
If you want all the instance of the cells in which those appear, try
sample_csv[sample_csv.stack().str.contains('%|\$').unstack()]
And you get
Amount Percent SMS_number SMS_text
0 12.12% 10% NaN $1
1 $10.12 NaN NaN NaN
Upvotes: 2