How to extract text from a variable using Scrapy?

Question

I am scraping a business directory using Scrapy and am running into an issue with trying to extract data using variables. Here is the code:

    def parse_page(self, response):
    url = response.meta.get('URL')

    # Parse the locations area of the page
    locations = response.css('address::text').extract()
    # Takes the City and Province and removes unicode and removes whitespace,
    # they are still together though.
    city_province = locations[1].replace(u'\xa0', u' ').strip()
    # List of all social links that the business has
    social = response.css('.entry-content > div:nth-child(2) a::attr(href)').extract()

    add_info = response.css('ul.list-border li').extract()
    year = ""

    for info in add_info:
        if 'Year' in info:
            year = info
        else:
            pass

    yield {
        'title': response.css('h1.entry-title::text').extract_first().strip(),
        'description': response.css('p.mb-double::text').extract_first(),
        'phone_number': response.css('div.mb-double ul li::text').extract_first(default="").strip(),
        'email': response.css('div.mb-double ul li a::text').extract_first(default=""),
        'address': locations[0].strip(),
        'city': city_province.split(' ', 1)[0].replace(',', ''),
        'province': city_province.split(' ', 1)[1].replace(',', '').strip(),
        'zip_code': locations[2].strip(),
        'website': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(1) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'facebook': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(2) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'twitter': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(3) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'linkedin': response.css('.entry-content > div:nth-child(2) > ul:nth-child(2) > li:nth-child(4) > a:nth-child(1)::attr(href)').extract_first(default=''),
        'year': year,
        'employees': response.css('.list-border > li:nth-child(2)::text').extract_first(default="").strip(),
        'key_contact': response.css('.list-border > li:nth-child(3)::text').extract_first(default="").strip(),
        'naics': response.css('.list-border > li:nth-child(4)::text').extract_first(default="").strip(),
        'tags': response.css('ul.biz-tags li a::text').extract(),
    }

The problem I am having is from here:

        add_info = response.css('ul.list-border li').extract()
        year = ""

        for info in add_info:
            if 'Year' in info:
                year = info
            else:
                pass

The code checks to see if the information is "Year Established". However, it returns HTML. I am trying to get it so that it just prints out the Year. add_info = response.css('ul.list-border li::text').extract() will print out the year but how can I do this in the for loop?

Whenever "Year" is in info it outputs like this:

Year Established: 1998

. I am looking to just get the year and not the HTML.

James · Accepted Answer

Add the following function.

def getYear(yearnum):
    yearnum1 = str(yearnum[35:])
    yearnum2 = str(yearnum1[:4])
    return yearnum2

Then replace your for statement with the following.

for info in add_info:
    if 'Year' in info:
        yearanswer = getYear(info)
    else:
        pass

Then it will take the 4 digit number out of your long string and put it in the string yearanswer. If you print yearanswer is should print 1998. It did for me!

How to extract text from a variable using Scrapy?

Answers (1)

Related Questions