Venkateshwaran Selvaraj
Venkateshwaran Selvaraj

Reputation: 1785

Scraping from multiple url's using beatifulsoup

I have my code working. Now I would like to do a little bit of modification to get the date from multiple URLs, But URLs have only one word difference.

Here is my code, I am fetching from only one URL.

from string import punctuation, whitespace
import urllib2
import datetime
import re
from bs4 import BeautifulSoup as Soup
import csv
today = datetime.date.today()
html = urllib2.urlopen("http://www.99acres.com/property-in-velachery-chennai-south-ffid").read()

soup = Soup(html)
print "INSERT INTO `property` (`date`,`Url`,`Rooms`,`place`,`PId`,`Phonenumber1`,`Phonenumber2`,`Phonenumber3`,`Typeofperson`,` Nameofperson`,`typeofproperty`,`Sq.Ft`,`PerSq.Ft`,`AdDate`,`AdYear`)"
print 'VALUES'
re_digit = re.compile('(\d+)')
properties = soup.findAll('a', title=re.compile('Bedroom'))

for eachproperty in soup.findAll('div', {'class':'sT'}):
  a      = eachproperty.find('a', title=re.compile('Bedroom'))
  pdate  = eachproperty.find('i', {'class':'pdate'})
  pdates = re.sub('(\s{2,})', ' ', pdate.text)
  div    = eachproperty.find('div', {'class': 'sT_disc grey'})
  try:
    project = div.find('span').find('b').text.strip()
  except:
    project = 'NULL'        
  area = re.findall(re_digit, div.find('i', {'class': 'blk'}).text.strip())
  print ' ('
  print today,","+ (a['href'] if a else '`NULL`')+",", (a.string if a else 'NULL, NULL')+ "," +",".join(re.findall("'([a-zA-Z0-9,\s]*)'", (a['onclick'] if a else 'NULL, NULL, NULL, NULL, NULL, NULL')))+","+ ", ".join([project] + area),","+pdates+""
  print ' ), '

Here are the URLs that I would like to fetch from, at the same time

http://www.99acres.com/property-in-velachery-chennai-south-ffid
http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid
http://www.99acres.com/property-in-madipakkam-chennai-south-ffid

So you can see that there is only one word that differs in every URL.

I am trying to create an array like the following

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid
, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"
html = urllib2.urlopen(link)
soup = Soup(html)

This quite does not seem to work and I actually would like to just pass that ONE WORD to the URL like this

for locality in areas(madipakkam, thoraipakkam, velachery):
    link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"
    html= urllib2.urlopen(link)
    soup = BeautifulSoup(html)

Hope I made it clear

Upvotes: 1

Views: 404

Answers (1)

abarnert
abarnert

Reputation: 365627

This:

for locality in areas (http://www.99acres.com/property-in-velachery-chennai-south-ffid, http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid,    http://www.99acres.com/property-in-madipakkam-chennai-south-ffid):
link = "str(locality)"

… is not going to work, for multiple reasons.

First, you're calling an areas function you never defined anywhere. And I'm not sure what you expected that function to do anyway.

Second, you're trying to pass http://www.99acres.com/property-in-velachery-chennai-south-ffid as if it were a meaningful Python expression, when it's not even parseable. If you want to pass a string, you have to put it in quotes.

Third, "str(locality)" is the literal string str(locality). If you want to call the str function on the locality variable, don't put quotes around it. But really, there's no reason to call str at all; locality is already a string.

Finally, you didn't indent the body of the for loop. You have to indent that link = line, and all the stuff you were previously doing at the top level, so that it falls under the for. That way, it happens once for each value within the loop, instead of just once total after all the loops are done.

Try this:

for link in ("http://www.99acres.com/property-in-velachery-chennai-south-ffid",
             "http://www.99acres.com/property-in-thoraipakkam-chennai-south-ffid",
             "http://www.99acres.com/property-in-madipakkam-chennai-south-ffid"):
    # all the stuff you do for each URL

You were on the right track with this:

for locality in areas(madipakkam, thoraipakkam, velachery):
link = “http://www.99acres.com/property-in-+ str(locality)+-chennai-south-ffid"

Using a "template string" to avoid repeating yourself is almost always a good idea.

But again, there are a number of problems.

First, you've again called an areas function that doesn't exist, and tried to use bare strings without quotes around them.

Second, you've got the opposite problem to the last one: you tried to put expressions that you want to evaluate, + and str(locality), into the middle of a string. You need to break this up into two separate strings that can be part of the + expression.

And again, you haven't indented the loop body, and you're calling str unnecessarily.

So:

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = “http://www.99acres.com/property-in-" + locality + "-chennai-south-ffid"
    # all the stuff you do for each URL

While we're at it, it's usually a lot easier to read your code, and easier to make sure you haven't gotten something wrong, when you use formatting functions instead of trying to concatenate strings together. For example:

for locality in "velachery", "thoraipakkam", "madipakkam":
    link = "http://www.99acres.com/property-in-{}-chennai-south-ffid".format(locality)
    # all the stuff you do for each URL

Here, it's immediately obvious where each locality will fit into the string, and what the string will look like, and where the hyphens are, and so on.

Upvotes: 2

Related Questions