SIM
SIM

Reputation: 22440

Scraper fails to print all results

I've written a script in python to scrape "name" and "phone" of five items from craigslist. The problem I'm facing is that when I run my script, It only gives three results instead of five results. To be more specific, as the first two links do not have additional link (contact info) in their page so they do not need to make another request for opening any additional page. However the two links that do not have (contact info) links can't infiltrate through the "if ano_page_link:" statement in my second function and never get printed. How can I fix this flaw so that whether it has got phone number or not the scraper will print all the five results.

The script I'm trying with:

import re ; import requests ; from lxml import html

base = "http://bangalore.craigslist.co.in"

url_list = [
'http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-sunnyside/6259128505.html',
'http://bangalore.craigslist.co.in/reb/d/jayanagar-2nd-block-4000-sft/6221720477.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-ozone-type-3-r-villa/6259928614.html',
'http://bangalore.craigslist.co.in/reb/d/zed-homes-3-bedroom-flat-for/6257075793.html'
]

def get_link(medium_link):
    response = requests.get(medium_link).text
    tree = html.fromstring(response)
    try:
        name = tree.cssselect('span#titletextonly')[0].text
    except IndexError:
        name = ""
    try:
        link = base + tree.cssselect('a.showcontact')[0].attrib['href']
    except IndexError:
        link = ""
    parse_doc(name, link)

def parse_doc(title, ano_page_link):

    if ano_page_link:
        page = requests.get(ano_page_link).text            
        tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
        print(title, tel)

if __name__ == '__main__':
    for link in url_list:
        get_link(link)

Results I'm having:

Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364

Results I'm expecting:

A Flat is for sale at  Cooke Town
Prestige Sunnyside
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364

Upvotes: 1

Views: 122

Answers (2)

CtheSky
CtheSky

Reputation: 2624

You can gain more flexibility by separating the two tasks collect data and print data. It will be easier to add more infos later when you want to extend.

def collect_info(medium_link):
    response = requests.get(medium_link).text
    tree = html.fromstring(response)

    title = get_title(tree)
    contact_link = get_contact_link(tree)
    tel = get_tel(contact_link) if contact_link else ''

    return title, tel


def get_title(tree):
    try:
        name = tree.cssselect('span#titletextonly')[0].text
    except IndexError:
        name = ""
    return name

def get_contact_link(tree):
    try:
        link = base + tree.cssselect('a.showcontact')[0].attrib['href']
    except IndexError:
        link = ""
    return link

def get_tel(ano_page_link):
    page = requests.get(ano_page_link).text
    tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
    return tel

def print_info(title, tel):
    if tel:
        fmt = 'Title: {title}, Phone: {tel}'
    else:
        fmt = 'Title: {title}'
    print(fmt.format(title=title, tel=tel))

if __name__ == '__main__':
    for link in url_list:
        title, tel = collect_info(link)
        print_info(title, tel)

Upvotes: 1

Andersson
Andersson

Reputation: 52665

Note that, for example, on http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html there is no link matched by 'a.showcontact' selector, so following block

try:
    link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
    link = ""

will return link = ""

Then when you call if ano_page_link: all commands in if block are ignored as condition if "" is False and nothing is printed out

You can try below instead:

def parse_doc(title, ano_page_link):

    if ano_page_link:
        page = requests.get(ano_page_link).text            
        tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
        print(title, tel)
    else:
        print(title)

Upvotes: 1

Related Questions