Reputation: 7400
I have a poorly designed HTML page from which I am trying to extract data using scrapy. The following snippet is the one that I am interested in:
<html>
<h2 class="schoolName">Graduate School of Business</h2>
<ul title="Graduate School of Business departments - part 1"></ul>
<ul title="Graduate School of Business departments - part 2"></ul>
<ul title="Graduate School of Business departments - part 3"></ul>
<h2 class="schoolName">School of Law</h2>
<ul title="School of Law departments - part 1"></ul>
<ul title="School of Law departments - part 2"></ul>
<h2 class="schoolName">School of Medicine</h2>
<ul title="School of Medicine departments - part 1"></ul>
</html>
I specifically want to know the number of schools and the number of departments under each school. So I find the list of all schools as follows:
>>> schools = response.xpath('//h2[@class="schoolName"]/text()').getall()
>>> schools
['Graduate School of Business', 'School of Law', 'School of Medicine']
Then for each school I find the departments under them as follows:
>>> for school in schools:
... print(school)
... print(response.xpath(f'//h2[@class="schoolName"][text()[contains(.,"{school}")]]/following-sibling::ul/@title').extract())
... print ("-----------------------------")
...
Graduate School of Business
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part
2', 'Graduate School of Business departments - part 3', 'School of Law departments - part 1',
'School of Law departments - part 2', 'School of Medicine departments - part 1']
-----------------------------
School of Law
['School of Law departments - part 1', 'School of Law departments - part 2', 'School of Medicine
departments - part 1']
-----------------------------
School of Medicine
['School of Medicine departments - part 1']
-----------------------------
This is obviously not working as expected since the following-sibling is selecting all ul tags and not just those between two h2 tags. How do I achieve this?
Upvotes: 1
Views: 599
Reputation: 1135
One technique is to pick a common divider element that marks the beginning of a new block of info, use count()
and preceding-sibling
to measure its position, then select all the data elements that have the same number (plus one) of divider preceding siblings.
In an iPython shell:
In [1]: from lxml import etree
In [2]: string = '''<html>
...: <h2 class="schoolName">Graduate School of Business</h2>
...: <ul title="Graduate School of Business departments - part 1"></ul>
...: <ul title="Graduate School of Business departments - part 2"></ul>
...: <ul title="Graduate School of Business departments - part 3"></ul>
...: <h2 class="schoolName">School of Law</h2>
...: <ul title="School of Law departments - part 1"></ul>
...: <ul title="School of Law departments - part 2"></ul>
...: <h2 class="schoolName">School of Medicine</h2>
...: <ul title="School of Medicine departments - part 1"></ul>
...: </html>'''
In [3]: root = etree.fromstring(string)
In [4]: schools = root.xpath('//h2[@class="schoolName"]/text()')
In [5]: schools
Out[5]: ['Graduate School of Business', 'School of Law', 'School of Medicine']
In [6]: for school in schools:
...: print (school)
...: position = int(root.xpath(f'count(//h2[text()="{school}"]/preceding-sibling::h2) + 1'))
...: print (f"Position: {position}")
...: print (root.xpath(f'//ul[count(preceding-sibling::h2) = {position}]/@title'))
...:
Graduate School of Business
Position: 1
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 2', 'Graduate School of Business departments - part 3']
School of Law
Position: 2
['School of Law departments - part 1', 'School of Law departments - part 2']
School of Medicine
Position: 3
['School of Medicine departments - part 1']
Upvotes: 2