hi_there_hello
hi_there_hello

Reputation: 11

Scrapy - python - web crawler. Output list of all xpaths instead of only first match

revenues_in = MapCompose(MatchEndDate(float)) 

revenues_out = Compose(imd_filter_member, imd_mult, imd_max)


def add_xpath(self, field_name, xpath, *processors, **kw):

    values = self._get_values(xpath, **kw)

    self.add_value(field_name, values, *processors, **kw)

    return len(self._values[field_name])


def add_xpaths(self, name, paths):

    for path in paths:

        match_count = self.add_xpath(name, path)

        if match_count > 0:

            return match_count

    return 0



self.add_xpaths('revenues', [

        '//us-gaap:Revenues',

        '//us-gaap:SalesRevenueNet',

        '//us-gaap:SalesRevenueGoodsNet',

        '//us-gaap:SalesRevenueServicesNet',

        '//us-gaap:RealEstateRevenueNet',

        '//*[local-name()="NetRevenuesIncludingNetInterestIncome"]',

        '//*[contains(local-name(), "TotalRevenues") and contains(local-name(), "After")]',

        '//*[contains(local-name(), "TotalRevenues")]',

        '//*[local-name()="InterestAndDividendIncomeOperating" or local-name()="NoninterestIncome"]',

        '//*[contains(local-name(), "Revenue")]'

    ])

Currently, the code only spits out the first match in the list of xpaths. I'd like it to return the maximum value out of all xpaths that matched. Please advise.

This is of course a subsection of the code that I thought was relevant. If you'd like to see any additional code, please visit https://github.com/eliangcs/pystock-crawler/tree/master/pystock_crawler

Thank you for your time and help!

Upvotes: 1

Views: 101

Answers (1)

Tay
Tay

Reputation: 298

This isn't working because the add_xpaths function is returning a value at the end of every pass through the loop. This causes the loop to exit after the first run. Instead, you need to store the count in a variable and return it when you've looped through the entire data structure.

Instead of this:

def add_xpaths(self, name, paths):
    for path in paths:
        match_count = self.add_xpath(name, path)
         if match_count > 0:
             return match_count
    return 0

Try this:

def add_xpaths(self, name, paths):
    match_count = 0
    for path in paths:
        match_count += self.add_xpath(name, path)
    return match_count

Upvotes: 1

Related Questions