hanan
hanan

Reputation: 632

Extracting a string between two strings from HTML string

I have the following python code and trying to print the user and its number when tried to regex I did the following:

import re


txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n  <div class=\"label-field-pair11\">\n    <label for=\"student_grade\">Select member</label>\n    <div class =\"scrolable\" >\n      <div class=\"scroll-inside\">\n        <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All  <span> Add </span></a>\n\n        </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n          </div>\n        \n      </div>\n    </div>\n  </div>\n</div>\n\n\n");'''


regexp = re.findall(
            r"add_recipient\(([0-9]+)\)\" success=.+>([a-zA-Z0-9\w]+) ", txt)

for x in regexp:
    print(x[1],  x[0])

executing the above python code it prints as follows:

user 0000000
user 1111111
User 2222222
user 3333333
user 4444444
user 5555555
user 6666666
user 7777777
user 8888888

I needed to get the output as:

user Zero 0000000
user One 1111111
...

How can I get such output? in some cases the re.findall returns only user 8888888 and I don't know why. but how can I get the full match?

Upvotes: 0

Views: 73

Answers (3)

mozway
mozway

Reputation: 260530

Using regex to parse XML/HTML is bad practice, use a parser (with a bit of regex help) for that:

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(txt)

out = []
for e in soup.find_all('a', onclick=True):
    m = re.search('(?<=add_recipient\().*(?=\))', e['onclick'])
    if m:
        a = m.group()
        out.append((e.contents[0], a))

output:

[('user zero M ...', '0000000'),
 ('user One S ...', '1111111'),
 ('user Two A ...', '2222222'),
 ('user three H ...', '3333333'),
 ('user four M ...', '4444444'),
 ('user Five O ...', '5555555'),
 ('user six F ...', '6666666'),
 ('user Seven Mo ...', '7777777'),
 ('user eight ...', '8888888'),
 ('ِuser nine M ...', '9999999')]

alternative output (only first 2 words of name), replace the last line with:

out.append((' '.join(e.contents[0].split(maxsplit=2)[:2]), a))

output:

[('user zero', '0000000'),
 ('user One', '1111111'),
 ('user Two', '2222222'),
 ('user three', '3333333'),
 ('user four', '4444444'),
 ('user Five', '5555555'),
 ('user six', '6666666'),
 ('user Seven', '7777777'),
 ('user eight', '8888888'),
 ('ِuser nine', '9999999')]

Upvotes: 2

Corralien
Corralien

Reputation: 120409

I'm not an expert to regex

You can try:

out = re.findall(r"add_recipient\(([0-9]+)\)\" success=.+>(\w+\s+\w+)", txt)
print(*[' '.join(i[::-1]) for i in out], sep='\n')

# Output
user zero 0000000
user One 1111111
user Two 2222222
user three 3333333
user four 4444444
user Five 5555555
user six 6666666
user Seven 7777777
user eight 8888888

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163287

You can add an extra capture group, and change the order in which you print the group values.

Note that you can write [a-zA-Z0-9\w]+ as \w+ because that also matches a-zA-Z0-9.

Instead of .+> you can use [^<>]*> to prevent some backtracking, not crossing the angle brackets with a negated character class.

import re

txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n  <div class=\"label-field-pair11\">\n    <label for=\"student_grade\">Select member</label>\n    <div class =\"scrolable\" >\n      <div class=\"scroll-inside\">\n        <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All  <span> Add </span></a>\n\n        </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n          </div>\n        \n      </div>\n    </div>\n  </div>\n</div>\n\n\n");'''

for x in re.findall(r"add_recipient\(([0-9]+)\)\" success=[^<>]*>(\w+) (\w+)", txt):
    print(x[1], x[2], x[0])

Output

user zero 0000000
user One 1111111
user Two 2222222
user three 3333333
user four 4444444
user Five 5555555
user six 6666666
user Seven 7777777
user eight 8888888

Upvotes: 0

Related Questions