Justin M
Justin M

Reputation: 309

How to get rid of a specific div tag when scraping using Python?

The tag structure is like below.

<div class="a">
  <div class="b"></div>
  <div class="c"></div>
  Text...
</div>

I would like to get class a div excluding class b and c divs.

How could I do this?

Upvotes: 0

Views: 130

Answers (2)

Forensic_07
Forensic_07

Reputation: 1135

Parsing the xml with etree and using an xpath expression to select the text nodes you want may be the best solution here, combined with some Python string manipulation as needed. Demonstrating in iPython:

In [1]: from lxml import etree

In [2]: str = '''<div class="a">
   ...:   <div class="b">Unwanted</div>
   ...:   <div class="c">Unwanted</div>
   ...:   Text...
   ...: </div>'''

In [3]: root = etree.fromstring(str)

In [4]: root.xpath("//div[@class='a']/text()")
Out[4]: ['\n  ', '\n  ', '\n  Text...\n']

In [5]: ''.join(root.xpath("//div[@class='a']/text()")).strip()
Out[5]: 'Text...'

Upvotes: 1

Neal Titus Thomas
Neal Titus Thomas

Reputation: 483

You can use soup.find for this:

filtered_divs = soup.find_all("div", {"class": "a"})

Upvotes: 0

Related Questions