Pro Girl
Pro Girl

Reputation: 952

Regex extracting sub string python

I'm trying to extract a sub string from an exact point till a special character ", this is the string:

element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'

the part i want to extract is the keyword from: data-keyword=" till: the next " symbol, so in this case: aa battery plus

but I just get as result a letter, ever when limiting the string on left and right with the \b delimiter and square brackets.

I tried to use the re.findall() method

import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa batteries plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
z = re.search(r'[\bdata-keyword="\b,'""']',element).group(0)
print(z)

This is what I get:

d
Process finished with exit code 0

How do I only extract the keyword? IE: aa batteries plus

Upvotes: 3

Views: 525

Answers (6)

Sebastien D
Sebastien D

Reputation: 4482

You can use re.findall() function:

import re
element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
output = re.findall(r'data-keyword="(.*?)"', element)[0]
print(output)

Output

aa battery plus

Upvotes: 1

Emma
Emma

Reputation: 27723

This expression might likely work here, even though it may not be the best idea, for which we might want to approach solving the problem using this method, yet if we have to:

data-keyword="\s*([^"]+?)\s*"

might also remove the undesired spaces before and after our desired data.

Demo 1

TEST

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"data-keyword=\"\s*([^\"]+?)\s*\""

test_str = ("<div class=\"s-suggestion\" data-alias=\"aps\" data-crid=\"2AZHZA23OLYLF\" data-isfb=\"false\" data-issc=\"false\" data-keyword=\"aa batteries plus\" data-nid=\"\" data-reftag=\"nb_sb_ss_i_6_2\" data-store=\"\" data-type=\"a9\" id=\"issDiv5\"><span class=\"s-heavy\"></span>ab<span class=\"s-heavy\">reva cold sore treatment</span></div>\n"
    "<div class=\"s-suggestion\" data-alias=\"aps\" data-crid=\"2AZHZA23OLYLF\" data-isfb=\"false\" data-issc=\"false\" data-keyword=\"    aa batteries plus     \" data-nid=\"\" data-reftag=\"nb_sb_ss_i_6_2\" data-store=\"\" data-type=\"a9\" id=\"issDiv5\"><span class=\"s-heavy\"></span>ab<span class=\"s-heavy\">reva cold sore treatment</span></div>")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Output

Match 1 was found at 105-137: data-keyword="aa batteries plus"
Group 1 found at 119-136: aa batteries plus
Match 2 was found at 417-458: data-keyword="    aa batteries plus     "
Group 1 found at 435-452: aa batteries plus

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 2

Rakesh
Rakesh

Reputation: 82765

Its is not a good idea to use Regex to parse HTML. Instead you can use a html parser like BeautifulSoup.

Ex:

from bs4 import BeautifulSoup

element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa battery plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'
soup = BeautifulSoup(element, "html.parser")
print(soup.find("div", class_="s-suggestion")["data-keyword"])

Output:

aa battery plus

Upvotes: 3

mcchran
mcchran

Reputation: 75

While I totally agree with the previous answer you can consider the next solution as well:

element.split('data-keyword="')[-1].split('" data-nid')[0]

This may be considered quite convenient when you need to parse "structured" inputs...

Upvotes: 1

DaWhiteSheep
DaWhiteSheep

Reputation: 31

You do not need a regex for this. You can simply search for the index of 'data-keyword' with the built-in function find(substring,begin,end). Then perform a search for the index for each of the following parentheses and slice the text that is between these.

i_key = element.find('data-keyword')
i_1 = element.find('"', i_key)
i_2 = element.find('"', i_1+1)
z = element[i_1+1:i_2]

More info on the find function.

Upvotes: 1

Salman Farsi
Salman Farsi

Reputation: 466

If you want the text between two strings, You'll need to use this regex format.

import re

element = '<div class="s-suggestion" data-alias="aps" data-crid="2AZHZA23OLYLF" data-isfb="false" data-issc="false" data-keyword="aa batteries plus" data-nid="" data-reftag="nb_sb_ss_i_6_2" data-store="" data-type="a9" id="issDiv5"><span class="s-heavy"></span>ab<span class="s-heavy">reva cold sore treatment</span></div>'

z = re.search(r'data-keyword="(.*?)"', element).group(1)
print(z)

Upvotes: 3

Related Questions