saurabh
saurabh

Reputation: 47

end line is not parsing correctly with re library python

Consider the string:

<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059

I want to get the result Uttam Nagar East using a regex function, but the output I'm getting is

Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1'

I've tried using

print(re.findall(r'data-rlocation="(.*)["]',contents))

and

print(re.findall(r'data-rlocation="(.*)"',contents))

Upvotes: 0

Views: 77

Answers (5)

Emma
Emma

Reputation: 27763

Maybe, find_all from bs4 might return the desired output:

from bs4 import BeautifulSoup

line = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
soup = BeautifulSoup(line, 'html.parser')

for l in soup.find_all('p'):
    print(l['data-rlocation'])

Output

Uttam Nagar East

If not, maybe

(?i)data-rlocation="([^\r\n"]*)"

with re.findall might be another option.

import re

expression = r'(?i)data-rlocation="([^\r\n"]*)"'

string = """
<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059
"""

print(re.findall(expression, string))

Output

['Uttam Nagar East']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Upvotes: 0

Aleksandar
Aleksandar

Reputation: 1776

A positive lookbehind and positive lookahead with a lazy match will do the trick.

Pattern: (?<=data-rlocation=").*?(?=")

Code: print(re.findall(r'(?<=data-rlocation=").*?(?=")',contents))

Demo on regex101

Explanation

  • (?<= use a positive lookahead. It will not return the string. It will only make sure that this pattern is right before the match.
    • data-rlocation=" this is the string to match
  • ) close the positive lookahead
  • .* match every single character of the string we want to return
  • ? make the * lazy (not greedy)
  • (?= open a positive lookahead to match the closing pattern but don't return the string
    • " match the next double quote
  • ) close the positive lookahead

Upvotes: 1

Dev Khadka
Dev Khadka

Reputation: 5481

you are using greedy regex you can add '?' to make it non greedy

import re
contents = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
print(re.findall(r'data-rlocation="(.*?)"',contents))

Upvotes: 1

mackorone
mackorone

Reputation: 1066

By default, * is greedy, which means that it tries to consume as many characters as possible. If you'd rather match as few characters as possible, you can use the non-greedy qualifier *? instead:

print(re.findall(r'data-rlocation="(.*?)"',contents))

More information: https://docs.python.org/3.5/howto/regex.html#greedy-versus-non-greedy

Upvotes: 1

Zach Gates
Zach Gates

Reputation: 4155

The group (.*) is including the closing quotes in its capture. Try this instead:

>>> re.findall(r'data-rlocation="([^"]*)"', contents)
['Uttam Nagar East']

Check out how it works here.

Upvotes: 3

Related Questions