ds_user
ds_user

Reputation: 2179

python - regex - need string if there is no comma and second space in it

I have written this function,

def my_func(s):
    wordlist = ('unit','room','lot')
    if if any(re.match(r'^'+ word + r'\b' + r'.*$', s.lower()) for word in wordlist) and any(i.isdigit() for i in s.lower())::
        if ',' in s:
            out = re.findall(r"(.*),", s) #Getting everything before comma
            return out[0]
        else:
            out = re.findall(r"([^\s]*\s[^\s]*)", s) #Getting everything before second space.
            return out[0]

My test data and the expected output

Unity 11 Lane. --> None
Unit 11 queen street --> Unit 11
Unit 7, king street --> Unit 7
Lot 12 --> Lot 12
Unit street --> None

My logic here is

  1. Take up to first comma, if there is ',' in the string.
  2. Take up to second space if there is no comma
  3. Dont bring out anything if the string is not starting with anything in the wordlist.
    1. Bring all if no second space or comma in it.

Everything else is working fine, how to capture Lot 12 here, say if the string matches wordlist and there is no ',' and no second space, then bring it all

Upvotes: 0

Views: 664

Answers (1)

zwer
zwer

Reputation: 25799

You're overcomplicating this, it's a simple word + whitespace + digits match:

import re

def my_func(s):
    wordlist = ('unit', 'room', 'lot') 
    result = re.match(r"((?:{})\s+\d+)".format("|".join(wordlist)), s, re.IGNORECASE)
    if result:
        return result.group()

Let's test it:

test_data = ["Unity 11 Lane.",
             "Unit 11 queen street",
             "Unit 7, king street",
             "Lot 12",
             "Unit street"]

for entry in test_data:
    print("{} --> {}".format(entry, my_func(entry)))

Which gives:

Unity 11 Lane. --> None
Unit 11 queen street --> Unit 11
Unit 7, king street --> Unit 7
Lot 12 --> Lot 12
Unit street --> None

If you really want to match everything before a whitespace, a comma or EOL, you can do it by replacing the regex with:

result = re.match(r"((?:{})\s+.+?(?=\s|,|$))".format("|".join(wordlist)), s, re.IGNORECASE)

But this will match one of your undesired strings because the pattern cannot know that you like and but don't like street:

Unity 11 Lane. --> None
Unit 11 queen street --> Unit 11
Unit 7, king street --> Unit 7
Lot 12 --> Lot 12
Unit street --> Unit street

Upvotes: 1

Related Questions