Keagan McNew
Keagan McNew

Reputation: 113

How to create a regex pattern to pull a character out of a list of variously structured strings?

I am using regex to pull the letter "u" out of an address string, but only when it is being used as an abbreviation (u, u., U, U., etc). However, the issue I am running is that the list of strings I have are messy and filled with mistakes. I have already attempted to pull what I need out a variety of mistakes I have seen in the data. I know I must be missing something small, but any help is appreciated.

I have tried these regex expressions:

I also have another idea of getting around this problem which would require pulling apart the addresses (split between street, number, etc), and then fix the street part and glue it back together. I have had some luck actually pulling just the number part out:

However, I would like to see where I am messing up in the regex expression that should be selecting the "u." Regex101.com has been my best friend with this, and I would not have made it this far without it.

test_strings = [
    "Holics u 5/a",
    "Holics U 5/a",
    "Holics u5/a",
    "Huolics u 5/a",
    "Holics u. 5/a",
    "Holuics u5",
    "Holics and other stuff u more stuff after 5",
    "Houlics utca 5"
]

# two regex patterns I have considered 

print("First regex pattern ------------------------------------")
pattern = r"[^\w+][uU]"
replacement_text = " utca "

for item in test_strings:
    print(re.sub(pattern,replacement_text,item))

print("\nSecond regex pattern ------------------------------------")
pattern = r"[^\w+][uU][^tca]"
replacement_text = " utca "

for item in test_strings:
    print(re.sub(pattern,replacement_text,item))

results from the above code:

First regex pattern:

Holics utca  5/a
Holics utca  5/a
Holics utca 5/a
Huolics utca  5/a
Holics utca . 5/a
Holuics utca 5
Holics and other stuff utca  more stuff after 5
Houlics utca tca 5 # <-------------------------------- issue

Second regex pattern:

Holics utca 5/a
Holics utca 5/a
Holics utca /a # <----------------------------------- issue
Huolics utca 5/a
Holics utca  5/a
Holuics utca  <-------------------------------------- issue
Holics and other stuff utca more stuff after 5
Houlics utca 5

Everything works except for the last line ("Houlics utca tca 5") in the first regex pattern, and when I try to create an expression to also consider strings that contain "utca," I lose out on the numbers in strings like "Holics u5/a."

For the most part, I expect the outcome to be:

As a final note, I have functions that remove periods and white space.

Upvotes: 1

Views: 83

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You may use

re.sub(r'\b[uU](?=\b|\d)\.?\s*', 'utca ', s)

Details

  • \b - word boundary
  • [uU] - u or U
  • (?=\b|\d) - there must be a word boundary or a digit immediately to the right of the current location
  • \.? - an optional dot
  • \s* - 0+ whitespaces.

Alternatively, you may use

re.sub(r'\b[uU](?=\b|(?![^\W\d_]))\.?\s*', 'utca ', s)

See the regex demo and another regex demo.

Here, instead of the digit requirement, the (?![^\W\d_]) fails if the next char is a letter.

Upvotes: 1

Related Questions