Sebastian
Sebastian

Reputation: 1101

Scraping a list separated by <br>-tags

I need to scrape a web site that has a list which uses a very unfortunate format:

<div class="post">
    <b>FIELD1</b><br/>FIELD2<br/>FIELD3<br/><br/>
    <b>FIELD1</b><br/>FIELD2<br/>FIELD3<br/><br/>
    <b>FIELD1</b><br/>FIELD2<br/>FIELD3<br/><br/>
</div>

I.e., everything is separated by <br/>-tags. A new element can be identified by a double <br/>-tag at the end of an element, and the new element begins with a <br/> tag. Making things worse, FIELD3 may also contain <br/> tags. Put differently, FIELD2 is "the field that comes after the closing </b> tag" and FIELD3 is "the field that comes before the double <br/> tag.

This is what I have so far:

Since I couldn't find a good way to grab FIELD2 and FIELD3, I tried creating a <p> tag around FIELD2 and 3 by replacing </b><br/> with </b><p> and <br/><br/> with <br/><p>:

def parse(self, response):
    items = response.xpath('//div[@id="mainDiv"]/div[1]')
    items = str.replace(items, "</b><br/>", "</b><p>")
    items = str.replace(items, "<br/><br/>", "</p><br/>")

    for item in items :
        dateX = item.xpath('.//b/text()').extract()
        infoX = item.xpath('.//p/text()').extract()

However, that doesn't work (TypeError: descriptor 'replace' requires a 'str' object but received a 'SelectorList'). That aside, I'm sure there has to be a better solution, but I can't seem to find what it is.

Any help is greatly appreciated!

Upvotes: 0

Views: 2047

Answers (2)

Sebastian
Sebastian

Reputation: 1101

This does the trick:

    def parse(self, response):
    items = response.xpath('//div[@id="mainDiv"]/div[1]')
    for item in items :

    i=1
    while (i < 10):
        field1 = item.xpath('.//b['+str(i)+']/text()').extract()    
        field2 = item.xpath('.//b['+str(i)+']/following-sibling::text()[1]').extract()      
        field3 = item.xpath('.//b['+str(i)+']/following-sibling::text()[2]').extract()

        yield {
            'field1': field1
            ,'field2': field2
            ,'field3': field3
        }

        i=i+1

Only thing left now is to replace i < 10 with the correct total number, but that should be easy enough.

Thanks again, @Tomalak, for pointing me in the right direction!

Upvotes: 0

Tomalak
Tomalak

Reputation: 338128

What about this (or something close to it):

def parse(self, response):
    posts = response.xpath('//div[@id="mainDiv"]/div[@class="post"]')
    for post in posts:
        field1 = post.xpath('./b/text()').extract()
        field2 = post.xpath('./br[1]/following-sibling::text()[1]').extract()
        field3 = post.xpath('./br[2]/following-sibling::text()[1]').extract()

Key point is: Don't use string functions (split, regex, search and replace) on HTML. This is a rule that always applies, but doubly so when you already have a fully parsed DOM tree with XPath support. There is an XPath expression for any node in the tree.

Upvotes: 1

Related Questions