Scraper fails to print all results

Question

I've written a script in python to scrape "name" and "phone" of five items from craigslist. The problem I'm facing is that when I run my script, It only gives three results instead of five results. To be more specific, as the first two links do not have additional link (contact info) in their page so they do not need to make another request for opening any additional page. However the two links that do not have (contact info) links can't infiltrate through the "if ano_page_link:" statement in my second function and never get printed. How can I fix this flaw so that whether it has got phone number or not the scraper will print all the five results.

The script I'm trying with:

import re ; import requests ; from lxml import html

base = "http://bangalore.craigslist.co.in"

url_list = [
'http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-sunnyside/6259128505.html',
'http://bangalore.craigslist.co.in/reb/d/jayanagar-2nd-block-4000-sft/6221720477.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-ozone-type-3-r-villa/6259928614.html',
'http://bangalore.craigslist.co.in/reb/d/zed-homes-3-bedroom-flat-for/6257075793.html'
]

def get_link(medium_link):
    response = requests.get(medium_link).text
    tree = html.fromstring(response)
    try:
        name = tree.cssselect('span#titletextonly')[0].text
    except IndexError:
        name = ""
    try:
        link = base + tree.cssselect('a.showcontact')[0].attrib['href']
    except IndexError:
        link = ""
    parse_doc(name, link)

def parse_doc(title, ano_page_link):

    if ano_page_link:
        page = requests.get(ano_page_link).text            
        tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
        print(title, tel)

if __name__ == '__main__':
    for link in url_list:
        get_link(link)

Results I'm having:

Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364

Results I'm expecting:

A Flat is for sale at  Cooke Town
Prestige Sunnyside
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364

Andersson · Accepted Answer

Note that, for example, on http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html there is no link matched by 'a.showcontact' selector, so following block

try:
    link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
    link = ""

will return link = ""

Then when you call if ano_page_link: all commands in if block are ignored as condition if "" is False and nothing is printed out

You can try below instead:

def parse_doc(title, ano_page_link):

    if ano_page_link:
        page = requests.get(ano_page_link).text            
        tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
        print(title, tel)
    else:
        print(title)

Scraper fails to print all results

Answers (2)

Related Questions