How to get rid of a specific div tag when scraping using Python?

Question

The tag structure is like below.


  
  
  Text...

I would like to get class a div excluding class b and c divs.

How could I do this?

Forensic_07 · Accepted Answer

Parsing the xml with etree and using an xpath expression to select the text nodes you want may be the best solution here, combined with some Python string manipulation as needed. Demonstrating in iPython:

In [1]: from lxml import etree

In [2]: str = '''
   ...:   Unwanted
   ...:   Unwanted
   ...:   Text...
   ...: '''

In [3]: root = etree.fromstring(str)

In [4]: root.xpath("//div[@class='a']/text()")
Out[4]: ['
  ', '
  ', '
  Text...
']

In [5]: ''.join(root.xpath("//div[@class='a']/text()")).strip()
Out[5]: 'Text...'

How to get rid of a specific div tag when scraping using Python?

Answers (2)

Related Questions