Reputation: 47
Consider the string:
<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059
I want to get the result Uttam Nagar East
using a regex function, but the output I'm getting is
Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1'
I've tried using
print(re.findall(r'data-rlocation="(.*)["]',contents))
and
print(re.findall(r'data-rlocation="(.*)"',contents))
Upvotes: 0
Views: 77
Reputation: 27763
Maybe, find_all
from bs4 might return the desired output:
from bs4 import BeautifulSoup
line = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
soup = BeautifulSoup(line, 'html.parser')
for l in soup.find_all('p'):
print(l['data-rlocation'])
Uttam Nagar East
If not, maybe
(?i)data-rlocation="([^\r\n"]*)"
with re.findall
might be another option.
import re
expression = r'(?i)data-rlocation="([^\r\n"]*)"'
string = """
<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059
"""
print(re.findall(expression, string))
['Uttam Nagar East']
If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Upvotes: 0
Reputation: 1776
A positive lookbehind and positive lookahead with a lazy match will do the trick.
Pattern: (?<=data-rlocation=").*?(?=")
Code: print(re.findall(r'(?<=data-rlocation=").*?(?=")',contents))
Explanation
(?<=
use a positive lookahead. It will not return the string. It will only make sure that this pattern is right before the match.
data-rlocation="
this is the string to match)
close the positive lookahead.*
match every single character of the string we want to return?
make the *
lazy (not greedy)(?=
open a positive lookahead to match the closing pattern but don't return the string
"
match the next double quote)
close the positive lookaheadUpvotes: 1
Reputation: 5481
you are using greedy regex you can add '?' to make it non greedy
import re
contents = '<p class="sm clg" data-rlocation="Uttam Nagar East">Uttam Nagar East, Delhi <span class="to-txt" id="citytt1">B-24, East Uttam Nagar, Uttam Nagar East,<br>Delhi<span> - </span>110059'
print(re.findall(r'data-rlocation="(.*?)"',contents))
Upvotes: 1
Reputation: 1066
By default, *
is greedy, which means that it tries to consume as many characters as possible. If you'd rather match as few characters as possible, you can use the non-greedy qualifier *?
instead:
print(re.findall(r'data-rlocation="(.*?)"',contents))
More information: https://docs.python.org/3.5/howto/regex.html#greedy-versus-non-greedy
Upvotes: 1
Reputation: 4155
The group (.*)
is including the closing quotes in its capture. Try this instead:
>>> re.findall(r'data-rlocation="([^"]*)"', contents)
['Uttam Nagar East']
Check out how it works here.
Upvotes: 3