How to extract only one string from regex in Python?

Question

I have been trying to build a simple account manager sort of application for myself using Python which will read SMS from my phone and extract information based on some regex patterns.

I wrote a complex regex pattern and tested the same on https://pythex.org/. Example:

Text: 1.00 is debited from ******1234  for food

Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)

Result: from ******1234

However, when I try to do the same in Python using the str.extract() method, rather than getting a single result, I am getting a dataframe with a column for each group.

Python code looks like this:

all_sms=pd.read_csv("all_sms.csv")

pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'

test = all_sms.extract(pattern, expand = False)

Output of the python code for the message above:

0           from
1               
2            NaN
3            NaN
4            NaN
5     ******1234
6           1234
7           1234
8               
9               
10

I am very new to Python and trying to learn by hands-on experience, it would be really helpful if someone can point out where I am going wrong with this?

fmv1992 · Accepted Answer

Before diving into your regex pattern you should understand why you are using pandas. Pandas is suitable for data analysis (thus suitable for your problem) but seems like an overkill here.

If you are a beginner I advice you to stick with pure python not because pandas is complicated but because knowing the python standard library will help you in the long run. If you skip the basics now this may hurt you in the long run.

Considering you are going to use python3 (without pandas) I would proceed as follow:

# Needed imports from standard library.
import csv
import re

# Declare the constants of my tiny program.
PATTERN = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
COMPILED_REGEX = re.compile(PATTERN)

# This list will store the matched regex.
found_regexes = list()

# Do the necessary loading to enable searching for the regex.
with open('mysmspath.csv', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ', quotechar='"')
    # Iterate over rows in your csv file.
    for row in csv_reader:
        match = COMPILED_REGEX.search(row)
        if match:
            found_regexes.append(row)

print(found_regexes)

Not necessarily this is going to solve your problem with copy-paste but this might give you an idea of a more simpler approach to your problem.

How to extract only one string from regex in Python?

Answers (1)

Related Questions