Reputation: 87
I am trying to scrape a web page with the following code:-
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true")
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'details-panel'})
hrefs = [link['href'] for link in links]
for urls in hrefs:
pages = requests.get(urls)
soup_2 =BeautifulSoup(pages.content, 'html.parser')
Date = soup_2.find_all('li', attrs ={'class' :'sold-date'})
Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
Address = [Address.text.strip() for Address in Address_1]
the above code is returning only the details from the first URL in the hrefs.
['Mon 05-Jun-17'] ['261 Keilor Road, Essendon, Vic 3040']
I need the loop to run through each URL in hrefs & return similar details from each URL in hrefs. Please suggest what should i add/edit in the above code. Any help would be highly appreciated.
Thanks
Upvotes: 0
Views: 1474
Reputation: 11009
You are overwriting Address
and Sold_Date
objects on each iteration:
# after assignment previous data will be lost
Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
Address = [Address.text.strip() for Address in Address_1]
Try to initialize empty list
s outside of loop and extend them
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true")
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs={'class': 'details-panel'})
hrefs = [link['href'] for link in links]
addresses = []
sold_dates = []
for urls in hrefs:
pages = requests.get(urls)
soup_2 = BeautifulSoup(pages.content, 'html.parser')
dates_tags = soup_2.find_all('li', attrs={'class': 'sold-date'})
sold_dates += [date_tag.text.strip() for date_tag in dates_tags]
addresses_tags = soup_2.find_all('p', attrs={'class': 'full-address'})
addresses += [address_tag.text.strip() for address_tag in addresses_tags]
gives us
>>>sold_dates
[u'Tue 06-Jun-17',
u'Tue 06-Jun-17',
u'Tue 06-Jun-17',
u'Tue 06-Jun-17',
u'Tue 06-Jun-17',
u'Tue 06-Jun-17',
u'Tue 06-Jun-17',
u'Mon 05-Jun-17',
u'Mon 05-Jun-17',
u'Mon 05-Jun-17']
>>>addresses
[u'141 Napier Street, Essendon, Vic 3040',
u'5 Loupe Crescent, Leopold, Vic 3224',
u'80 Ryrie Street, Geelong, Vic 3220',
u'18 Boase Street, Brunswick, Vic 3056',
u'130-186 Buckley Street, West Footscray, Vic 3012',
u'223 Park Street, South Melbourne, Vic 3205',
u'48-50 The Centreway, Lara, Vic 3212',
u'14 Webster Street, Ballarat, Vic 3350',
u'323 Nepean Highway, Frankston, Vic 3199',
u'341 Buckley Street, Aberfeldie, Vic 3040']
Upvotes: 1
Reputation: 482
It is behaving in correct manner. You need to store the information in an external list and then return it.
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true")
soup = BeautifulSoup(page.content, 'html.parser')
links = soup.find_all('a', attrs ={'class' :'details-panel'})
hrefs = [link['href'] for link in links]
Data = []
for urls in hrefs:
pages = requests.get(urls)
soup_2 =BeautifulSoup(pages.content, 'html.parser')
Date = soup_2.find_all('li', attrs ={'class' :'sold-date'})
Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date]
Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'})
Address = [Address.text.strip() for Address in Address_1]
Data.append(Sold_Date + Address)
return Data
Upvotes: 1