Reputation: 22440
I've written a script in python to scrape "name" and "phone" of five items from craigslist. The problem I'm facing is that when I run my script, It only gives three results instead of five results. To be more specific, as the first two links do not have additional link (contact info) in their page so they do not need to make another request for opening any additional page. However the two links that do not have (contact info) links can't infiltrate through the "if ano_page_link:" statement in my second function and never get printed. How can I fix this flaw so that whether it has got phone number or not the scraper will print all the five results.
The script I'm trying with:
import re ; import requests ; from lxml import html
base = "http://bangalore.craigslist.co.in"
url_list = [
'http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-sunnyside/6259128505.html',
'http://bangalore.craigslist.co.in/reb/d/jayanagar-2nd-block-4000-sft/6221720477.html',
'http://bangalore.craigslist.co.in/reb/d/prestige-ozone-type-3-r-villa/6259928614.html',
'http://bangalore.craigslist.co.in/reb/d/zed-homes-3-bedroom-flat-for/6257075793.html'
]
def get_link(medium_link):
response = requests.get(medium_link).text
tree = html.fromstring(response)
try:
name = tree.cssselect('span#titletextonly')[0].text
except IndexError:
name = ""
try:
link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
link = ""
parse_doc(name, link)
def parse_doc(title, ano_page_link):
if ano_page_link:
page = requests.get(ano_page_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
if __name__ == '__main__':
for link in url_list:
get_link(link)
Results I'm having:
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364
Results I'm expecting:
A Flat is for sale at Cooke Town
Prestige Sunnyside
Jayanagar 2nd Block, 4000 sft Plot for Sale 9845012673
PRESTIGE OZONE TYPE D 3 B/R VILLA FOR SALE 9611226364
T ZED HOMES 3 BEDROOM FLAT FOR SALE 9611226364
Upvotes: 1
Views: 122
Reputation: 2624
You can gain more flexibility by separating the two tasks collect data and print data. It will be easier to add more infos later when you want to extend.
def collect_info(medium_link):
response = requests.get(medium_link).text
tree = html.fromstring(response)
title = get_title(tree)
contact_link = get_contact_link(tree)
tel = get_tel(contact_link) if contact_link else ''
return title, tel
def get_title(tree):
try:
name = tree.cssselect('span#titletextonly')[0].text
except IndexError:
name = ""
return name
def get_contact_link(tree):
try:
link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
link = ""
return link
def get_tel(ano_page_link):
page = requests.get(ano_page_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
return tel
def print_info(title, tel):
if tel:
fmt = 'Title: {title}, Phone: {tel}'
else:
fmt = 'Title: {title}'
print(fmt.format(title=title, tel=tel))
if __name__ == '__main__':
for link in url_list:
title, tel = collect_info(link)
print_info(title, tel)
Upvotes: 1
Reputation: 52665
Note that, for example, on http://bangalore.craigslist.co.in/reb/d/flat-is-for-sale-at-cooke-town/6266183606.html there is no link matched by 'a.showcontact'
selector, so following block
try:
link = base + tree.cssselect('a.showcontact')[0].attrib['href']
except IndexError:
link = ""
will return link = ""
Then when you call if ano_page_link:
all commands in if
block are ignored as condition if ""
is False
and nothing is printed out
You can try below instead:
def parse_doc(title, ano_page_link):
if ano_page_link:
page = requests.get(ano_page_link).text
tel = re.findall(r'\d{10}', page)[0] if re.findall(r'\d{10}', page) else ""
print(title, tel)
else:
print(title)
Upvotes: 1