FrankMank1
FrankMank1

Reputation: 73

How can I extract text under html div id tag in python

I was wondering how I would be able to extract the text from this tag from this website: https://ru.thefreedictionary.com/%d1%88%d1%87%d0%be

<div id="MainTxt">


            Слово в словаре не найдено.
 <div id="didYouMean"></div>Быть может, вы искали:
<div style="margin:6px 0 3px 0">

The code I'm using gets everything under the id tag, but I'm looking only to get the text 'Слово в словаре не найдено.'

soup.findAll("div", attrs = {"id": ["MainTxt"]})

Thank you for any help!

Upvotes: 0

Views: 1176

Answers (2)

joc
joc

Reputation: 199

First of all, there is no need to combine findAll() with id attribute because there can only be one element with that id in that html so findAll() will always return list of one element. Here is how you could solve your problem.

match = soup.find('div', {'id': 'MainTxt'})
text = match.text.rstrip().lstrip().split('\n')

rstrip() and lstrip() are for removing trailing spaces in front and in the back of the string. Now text is a list of elements: ['Слово в словаре не найдено.\r', ' Быть может, вы искали:\r', '', ...]. To get your target string is easy.

target_string = text[0].replace('\r', '')

Upvotes: 1

swarles-barkley
swarles-barkley

Reputation: 65

I believe the problem you're having is that there is no </div> on the html page directly after 'Слово в словаре не найдено.'

That means that "MainTxt" includes everything until the next </div> that isn't opened. You can think of these much like parentheses or curly brackets.

So this is similar to . . .

Maintxt{
Слово в словаре не найдено.
didYouMean{}Быть может, вы искали:

You could try taking all of Maintxt, like in your code, and then removing all additional divs, but unfortunately this may not be as simple as a one-liner, since the html you're working with doesn't wrap Слово в словаре не найдено. in its own div

Upvotes: 1

Related Questions