Max.Hen
Max.Hen

Reputation: 1

How to pull the same nested data from a list of URLs using BeautifulSoup

Good Afternoon,

I'm relatively new to scraping and i'm currently caught up with this one project. The intended data to be pulled is the company name, address, phone number and company url (all pulled from the nested web page).

Main Page = http://www.therentalshow.com/find-exhibitors/sb-search/equipment/sb-inst/8678/sb-logid/242109-dcja1tszmylg308y/sb-page/1 Nested Page = http://www.therentalshow.com/exhibitor-detail/cid/45794/exhib/2019

I was able to compile this list of urls but I'm having the hardest time scraping each individual company information and outputting to a CSV in table format.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
import csv, os

my_url = 'http://www.therentalshow.com/find-exhibitors/sb-search/equipment/sb-inst/8678/sb-logid/242109-dcja1tszmylg308y/sb-page/1'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'lxml')

#create list of urls from main page
urls = []
tags = page_soup.find_all('a',{'class':'avtsb_title'})
for tag in tags:
    urls.append('http://www.therentalshow.com' + tag.get('href'))

#iterate through each page to return company data
for url in urls:
    site = uReq(url)
    soups = soup(site, 'lxml')

    name = page_soup.select('h2')
    address = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblAddress'})
    city = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblCityStateZip'})
    phone = page_soup.find('span',{'id':'dnn_ctr8700_TRSExhibitorDetail_lblPhone'})
    website = page_soup.find('a',{'id':'dnn_ctr8700_TRSExhibitorDetail_hlURL'})

    os.getcwd()
    outputFile = open('output2.csv', 'a', newline='')
    outputWriter = csv.writer(outputFile)
    outputWriter.writerow([name, address, city, phone, website])

My returned output is

[],,,,
[],,,,

99 lines in total. My total list of links is 100.

I would like the names of the aforementioned variables as headers to my csv file but my current output is not what i'm looking for. I'm quite lost so ANY help at all would be so greatly appreciated. Thank you!

Upvotes: 0

Views: 215

Answers (1)

QHarr
QHarr

Reputation: 84475

I can't fully test at present as requests is hanging but you need to extract the .text of returned elements. Also, your first selection is a list so change to select_one for example or index appropriately into list. I prefer css selectors over find.

I extracted html from one page into html variable

page_soup = bs(html, 'lxml')
name = page_soup.select_one('h2').text
address = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblAddress').text
city = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblCityStateZip').text
phone = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_lblPhone').text
website = page_soup.select_one('#dnn_ctr8700_TRSExhibitorDetail_hlURL').text
print([name, address, city, phone, website])

Copying the html from the first two links with the above changes yields:

['A-1 Scaffold Manufacturing', '590 Commerce Pkwy', 'Hays, KS', '785-621-5121', 'www.a1scaffoldmfg.com']
['Accella Tire Fill Systems', '2003 Curtain Pole Rd', 'Chattanooga, TN', '423-697-0400', 'www.accellatirefill.com']

Upvotes: 1

Related Questions