Reputation: 952
I'm trying to extract a sub string from an exact point till a special character ", this is the string:
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
the part i want to extract is the keyword from: data-keyword=" till: the next " symbol, so in this case: aa battery plus
but I just get as result a letter, ever when limiting the string on left and right with the \b delimiter and square brackets.
I tried to use the re.findall() method
import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa batteries plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
z = re.search(r'[\bdata-keyword="\b,'""']',element).group(0)
print(z)
This is what I get:
d
Process finished with exit code 0
How do I only extract the keyword? IE: aa batteries plus
Upvotes: 3
Views: 525
Reputation: 4482
You can use re.findall()
function:
import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
output = re.findall(r'data-keyword="(.*?)"', element)[0]
print(output)
Output
aa battery plus
Upvotes: 1
Reputation: 27723
This expression might likely work here, even though it may not be the best idea, for which we might want to approach solving the problem using this method, yet if we have to:
data-keyword="\s*([^"]+?)\s*"
might also remove the undesired spaces before and after our desired data.
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"data-keyword=\"\s*([^\"]+?)\s*\""
test_str = ("<div class=\"s-suggestion\" data-alias=\"aps\" data-crid=\"2AZHZA23OLYLF\" data-isfb=\"false\" data-issc=\"false\" data-keyword=\"aa batteries plus\" data-nid=\"\" data-reftag=\"nb_sb_ss_i_6_2\" data-store=\"\" data-type=\"a9\" id=\"issDiv5\"><span class=\"s-heavy\"></span>ab<span class=\"s-heavy\">reva cold sore treatment</span></div>\n"
"<div class=\"s-suggestion\" data-alias=\"aps\" data-crid=\"2AZHZA23OLYLF\" data-isfb=\"false\" data-issc=\"false\" data-keyword=\" aa batteries plus \" data-nid=\"\" data-reftag=\"nb_sb_ss_i_6_2\" data-store=\"\" data-type=\"a9\" id=\"issDiv5\"><span class=\"s-heavy\"></span>ab<span class=\"s-heavy\">reva cold sore treatment</span></div>")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Match 1 was found at 105-137: data-keyword="aa batteries plus"
Group 1 found at 119-136: aa batteries plus
Match 2 was found at 417-458: data-keyword=" aa batteries plus "
Group 1 found at 435-452: aa batteries plus
jex.im visualizes regular expressions:
Upvotes: 2
Reputation: 82765
Its is not a good idea to use Regex to parse HTML. Instead you can use a html parser like BeautifulSoup.
Ex:
from bs4 import BeautifulSoup
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
soup = BeautifulSoup(element, "html.parser")
print(soup.find("div", class_="s-suggestion")["data-keyword"])
Output:
aa battery plus
Upvotes: 3
Reputation: 75
While I totally agree with the previous answer you can consider the next solution as well:
element.split('data-keyword="')[-1].split('" data-nid')[0]
This may be considered quite convenient when you need to parse "structured" inputs...
Upvotes: 1
Reputation: 31
You do not need a regex for this.
You can simply search for the index of 'data-keyword' with the built-in function find(substring,begin,end)
. Then perform a search for the index for each of the following parentheses and slice the text that is between these.
i_key = element.find('data-keyword')
i_1 = element.find('"', i_key)
i_2 = element.find('"', i_1+1)
z = element[i_1+1:i_2]
More info on the find function.
Upvotes: 1
Reputation: 466
If you want the text between two strings, You'll need to use this regex format.
import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa batteries plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
z = re.search(r'data-keyword="(.*?)"', element).group(1)
print(z)
Upvotes: 3