Reputation: 632
I have the following python code and trying to print the user
and its number
when tried to regex
I did the following:
import re
txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n <div class=\"label-field-pair11\">\n <label for=\"student_grade\">Select member</label>\n <div class =\"scrolable\" >\n <div class=\"scroll-inside\">\n <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All <span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n </div>\n \n </div>\n </div>\n </div>\n</div>\n\n\n");'''
regexp = re.findall(
r"add_recipient\(([0-9]+)\)\" success=.+>([a-zA-Z0-9\w]+) ", txt)
for x in regexp:
print(x[1], x[0])
executing the above python code
it prints as follows:
user 0000000
user 1111111
User 2222222
user 3333333
user 4444444
user 5555555
user 6666666
user 7777777
user 8888888
I needed to get the output as:
user Zero 0000000
user One 1111111
...
How can I get such output? in some cases the re.findall
returns only user 8888888
and I don't know why. but how can I get the full match?
Upvotes: 0
Views: 73
Reputation: 260530
Using regex to parse XML/HTML is bad practice, use a parser (with a bit of regex help) for that:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(txt)
out = []
for e in soup.find_all('a', onclick=True):
m = re.search('(?<=add_recipient\().*(?=\))', e['onclick'])
if m:
a = m.group()
out.append((e.contents[0], a))
output:
[('user zero M ...', '0000000'),
('user One S ...', '1111111'),
('user Two A ...', '2222222'),
('user three H ...', '3333333'),
('user four M ...', '4444444'),
('user Five O ...', '5555555'),
('user six F ...', '6666666'),
('user Seven Mo ...', '7777777'),
('user eight ...', '8888888'),
('ِuser nine M ...', '9999999')]
alternative output (only first 2 words of name), replace the last line with:
out.append((' '.join(e.contents[0].split(maxsplit=2)[:2]), a))
output:
[('user zero', '0000000'),
('user One', '1111111'),
('user Two', '2222222'),
('user three', '3333333'),
('user four', '4444444'),
('user Five', '5555555'),
('user six', '6666666'),
('user Seven', '7777777'),
('user eight', '8888888'),
('ِuser nine', '9999999')]
Upvotes: 2
Reputation: 120409
I'm not an expert to regex
You can try:
out = re.findall(r"add_recipient\(([0-9]+)\)\" success=.+>(\w+\s+\w+)", txt)
print(*[' '.join(i[::-1]) for i in out], sep='\n')
# Output
user zero 0000000
user One 1111111
user Two 2222222
user three 3333333
user four 4444444
user Five 5555555
user six 6666666
user Seven 7777777
user eight 8888888
Upvotes: 0
Reputation: 163287
You can add an extra capture group, and change the order in which you print the group values.
Note that you can write [a-zA-Z0-9\w]+
as \w+
because that also matches a-zA-Z0-9
.
Instead of .+>
you can use [^<>]*>
to prevent some backtracking, not crossing the angle brackets with a negated character class.
import re
txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n <div class=\"label-field-pair11\">\n <label for=\"student_grade\">Select member</label>\n <div class =\"scrolable\" >\n <div class=\"scroll-inside\">\n <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All <span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n </div>\n \n \n <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n </div>\n \n </div>\n </div>\n </div>\n</div>\n\n\n");'''
for x in re.findall(r"add_recipient\(([0-9]+)\)\" success=[^<>]*>(\w+) (\w+)", txt):
print(x[1], x[2], x[0])
Output
user zero 0000000
user One 1111111
user Two 2222222
user three 3333333
user four 4444444
user Five 5555555
user six 6666666
user Seven 7777777
user eight 8888888
Upvotes: 0