LucSpan
LucSpan

Reputation: 1971

Convert string containing HTML to actual HTML

Set-up

I have various string variable containing HTML, for one of them https://pastebin.com/rsi3v9nh.

I need to obtain the text inside the HTML. E.g. from the following HTML snippet,

<div class="woocommerce-product-details__short-description">\n<ul>\n<li>50.000 r.p.m.</li>\n<li>Dry technique</li>\n<li>Controllable by foot pedal</li>\n<li>Auto-Cruise</li>\n<li>Twist-lock system</li>\n<li>100W drill power</li>\n<li>7.8 Ncm torque</li>\n<li>220V-240V</li>\n<li>12-months warranty</li>\n</ul>\n</div>\n<p>[/vc_column_text]</p>

I'd like to obtain the text of all <li>s.

Note that this is just an example of a part of the entire string, the texts are not only in <li> elements.


Problem

Simply using regex will be quite cumbersome, because the patterns are a bit irregular.

I'm familiar with Selenium to obtain data from HTML, i.e. to do driver.find_element_by_xpath('div') etc. But this works only on HTML objects, not strings.

I was wondering if I can somehow convert the string into HTML and then obtain the texts in a Selenium-like manner.

Any other solution would be ok as well.

Upvotes: 0

Views: 71

Answers (1)

user3483203
user3483203

Reputation: 51165

You definitely don't want to use regular expressions here.

You can use beautifulsoup to parse this instead:

from bs4 import BeautifulSoup

s = '<div class="woocommerce-product-details__short-description">\n<ul>\n<li>50.000 r.p.m.</li>\n<li>Dry technique</li>\n<li>Controllable by foot pedal</li>\n<li>Auto-Cruise</li>\n<li>Twist-lock system</li>\n<li>100W drill power</li>\n<li>7.8 Ncm torque</li>\n<li>220V-240V</li>\n<li>12-months warranty</li>\n</ul>\n</div>\n<p>[/vc_column_text]</p>'

soup = BeautifulSoup(s)
print(soup.findAll(text=True))

Output:

['\n', '\n', '50.000 r.p.m.', '\n', 'Dry technique', '\n', 'Controllable by foot pedal', '\n', 'Auto-Cruise', '\n', 'Twist-lock system', '\n', '100W drill power', '\n', '7.8 Ncm torque', '\n', '220V-240V', '\n', '12-months warranty', '\n', '\n', '\n', '[/vc_column_text]']

Upvotes: 2

Related Questions