Scraping a list separated by
-tags

Question

I need to scrape a web site that has a list which uses a very unfortunate format:


    FIELD1
FIELD2
FIELD3


    FIELD1
FIELD2
FIELD3


    FIELD1
FIELD2
FIELD3

I.e., everything is separated by -tags. A new element can be identified by a double -tag at the end of an element, and the new element begins with a tag. Making things worse, FIELD3 may also contain tags. Put differently, FIELD2 is "the field that comes after the closing tag" and FIELD3 is "the field that comes before the double tag.

This is what I have so far:

Since I couldn't find a good way to grab FIELD2 and FIELD3, I tried creating a

tag around FIELD2 and 3 by replacing with

and with

:

def parse(self, response):
    items = response.xpath('//div[@id="mainDiv"]/div[1]')
    items = str.replace(items, "
", "")
    items = str.replace(items, "

", "

")

    for item in items :
        dateX = item.xpath('.//b/text()').extract()
        infoX = item.xpath('.//p/text()').extract()

However, that doesn't work (TypeError: descriptor 'replace' requires a 'str' object but received a 'SelectorList'). That aside, I'm sure there has to be a better solution, but I can't seem to find what it is.

Any help is greatly appreciated!

Sebastian · Accepted Answer

This does the trick:

    def parse(self, response):
    items = response.xpath('//div[@id="mainDiv"]/div[1]')
    for item in items :

    i=1
    while (i < 10):
        field1 = item.xpath('.//b['+str(i)+']/text()').extract()    
        field2 = item.xpath('.//b['+str(i)+']/following-sibling::text()[1]').extract()      
        field3 = item.xpath('.//b['+str(i)+']/following-sibling::text()[2]').extract()

        yield {
            'field1': field1
            ,'field2': field2
            ,'field3': field3
        }

        i=i+1

Only thing left now is to replace i < 10 with the correct total number, but that should be easy enough.

Thanks again, @Tomalak, for pointing me in the right direction!

Scraping a list separated by <br>-tags

Answers (2)

Related Questions

Scraping a list separated by &lt;br&gt;-tags

Answers (2)

Related Questions

Scraping a list separated by <br>-tags