Reputation: 429
I am trying to extract golf course information from the "thegolfcourse.net" website. I aim to gather the name,address,and phone number of 18000+ golf courses in the United States from the website. I ran the my script but it does not produce all the data from the website. There are 18000+ golf courses but I only get about 200+ sites downloaded from the website. I dont know if my loop is wrong or I am not accessing all the data based on my code and further I get spaces in my data and I am wondering how I will extract the data properly.
Here is my script:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(56):
url="http://www.thegolfcourses.net/page/{}?ls&location&orderby=title".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("article")
for item in g_data2:
try:
name = item.contents[5].find_all("a")[0].text
print name
except:
name=''
try:
phone= item.contents[13].find_all("p",{"class":"listing-phone"})[0].text
except:
phone=''
try:
address= item.contents[13].find_all("p",{"class":"listing-address"})[0].text
except:
address=''
course=[name,phone,address]
courses_list.append(course)
with open ('PGN.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
Upvotes: 0
Views: 94
Reputation: 4399
First, your code didn't grab everything because you set the range to 56. Which is fine for testing but if you want to grab everything you need to set
for i in range(1907):
This goes to 1907 because it will stop at 1907 due to adding the .format(i+1)
to the URL portion.
Also, you had several errors in your for
loops. These may have been issues when you posted to StackOverflow but I cleaned them up anyway.
When I ran your code the first time I saw what the 'spacing' was. When you parsed the HTML, you parsed it looking for the article
tag but that tag is also taking care of the first search result which displays the "Listings found for "" near ""
in your example link. You can narrow down your scope when web scraping by using something like I did here:
g_data2 = soup.find_all("article",{"itemtype":"http://schema.org/Organization"})
This will make the scraping easier on you by only grabbing data that is found within the tag containing article
and itemtype = http://schema.org/Organization"
. That is unique enough and luckily all the entries match that format.
I also changed your csvwriter
from a
to wb
which starts a new CSV every time the script is run and does not append to it.
Here's the final script:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(1907):
url="http://www.thegolfcourses.net/page/{}?ls&location&orderby=title".format(i+1)
r = requests.get(url)
soup = BeautifulSoup(r.text)
#print soup
g_data2 = soup.find_all("article",{"itemtype":"http://schema.org/Organization"})
for item in g_data2:
try:
name = item.find_all("h2",{'class':'entry-title'})[0].text
print name
except:
name=''
print "No Name found!"
try:
phone= item.find_all("p",{"class":"listing-phone"})[0].text
except:
phone=''
print "No Phone found!"
try:
address= item.find_all("p",{"class":"listing-address"})[0].text
except:
address=''
print "No Address found!"
course=[name,phone,address]
courses_list.append(course)
with open ('PGN.csv','wb') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
Upvotes: 1