Reputation: 1101
I need to scrape a web site that has a list which uses a very unfortunate format:
<div class="post">
<b>FIELD1</b><br/>FIELD2<br/>FIELD3<br/><br/>
<b>FIELD1</b><br/>FIELD2<br/>FIELD3<br/><br/>
<b>FIELD1</b><br/>FIELD2<br/>FIELD3<br/><br/>
</div>
I.e., everything is separated by <br/>
-tags. A new element can be identified by a double <br/>
-tag at the end of an element, and the new element begins with a <br/>
tag. Making things worse, FIELD3 may also contain <br/>
tags. Put differently, FIELD2 is "the field that comes after the closing </b>
tag" and FIELD3 is "the field that comes before the double <br/>
tag.
This is what I have so far:
Since I couldn't find a good way to grab FIELD2 and FIELD3, I tried creating a <p>
tag around FIELD2 and 3 by replacing </b><br/>
with </b><p>
and <br/><br/>
with <br/><p>
:
def parse(self, response):
items = response.xpath('//div[@id="mainDiv"]/div[1]')
items = str.replace(items, "</b><br/>", "</b><p>")
items = str.replace(items, "<br/><br/>", "</p><br/>")
for item in items :
dateX = item.xpath('.//b/text()').extract()
infoX = item.xpath('.//p/text()').extract()
However, that doesn't work (TypeError: descriptor 'replace' requires a 'str' object but received a 'SelectorList'
). That aside, I'm sure there has to be a better solution, but I can't seem to find what it is.
Any help is greatly appreciated!
Upvotes: 0
Views: 2047
Reputation: 1101
This does the trick:
def parse(self, response):
items = response.xpath('//div[@id="mainDiv"]/div[1]')
for item in items :
i=1
while (i < 10):
field1 = item.xpath('.//b['+str(i)+']/text()').extract()
field2 = item.xpath('.//b['+str(i)+']/following-sibling::text()[1]').extract()
field3 = item.xpath('.//b['+str(i)+']/following-sibling::text()[2]').extract()
yield {
'field1': field1
,'field2': field2
,'field3': field3
}
i=i+1
Only thing left now is to replace i < 10 with the correct total number, but that should be easy enough.
Thanks again, @Tomalak, for pointing me in the right direction!
Upvotes: 0
Reputation: 338128
What about this (or something close to it):
def parse(self, response):
posts = response.xpath('//div[@id="mainDiv"]/div[@class="post"]')
for post in posts:
field1 = post.xpath('./b/text()').extract()
field2 = post.xpath('./br[1]/following-sibling::text()[1]').extract()
field3 = post.xpath('./br[2]/following-sibling::text()[1]').extract()
Key point is: Don't use string functions (split, regex, search and replace) on HTML. This is a rule that always applies, but doubly so when you already have a fully parsed DOM tree with XPath support. There is an XPath expression for any node in the tree.
Upvotes: 1