Gonzalo68
Gonzalo68

Reputation: 429

Data not properly being scraped from a given website using python

I am trying to extract golf course information from the "thegolfcourse.net" website. I aim to gather the name,address,and phone number of 18000+ golf courses in the United States from the website. I ran the my script but it does not produce all the data from the website. There are 18000+ golf courses but I only get about 200+ sites downloaded from the website. I dont know if my loop is wrong or I am not accessing all the data based on my code and further I get spaces in my data and I am wondering how I will extract the data properly.

Here is my script:

import csv
import requests 
from bs4 import BeautifulSoup

courses_list = []

for i in range(56):
 url="http://www.thegolfcourses.net/page/{}?ls&location&orderby=title".format(i)
 r = requests.get(url)
 soup = BeautifulSoup(r.content)


g_data2=soup.find_all("article")


for item in g_data2:
  try:
    name = item.contents[5].find_all("a")[0].text
    print name
  except:
        name=''      
  try:
    phone= item.contents[13].find_all("p",{"class":"listing-phone"})[0].text
  except:
      phone=''
  try:
    address= item.contents[13].find_all("p",{"class":"listing-address"})[0].text
  except:
      address=''

  course=[name,phone,address]
  courses_list.append(course)


with open ('PGN.csv','a') as file:
  writer=csv.writer(file)
  for row in courses_list:
          writer.writerow([s.encode("utf-8") for s in row])

Upvotes: 0

Views: 94

Answers (1)

serk
serk

Reputation: 4399

First, your code didn't grab everything because you set the range to 56. Which is fine for testing but if you want to grab everything you need to set

for i in range(1907):

This goes to 1907 because it will stop at 1907 due to adding the .format(i+1) to the URL portion.

Also, you had several errors in your for loops. These may have been issues when you posted to StackOverflow but I cleaned them up anyway.

When I ran your code the first time I saw what the 'spacing' was. When you parsed the HTML, you parsed it looking for the article tag but that tag is also taking care of the first search result which displays the "Listings found for "" near "" in your example link. You can narrow down your scope when web scraping by using something like I did here:

g_data2 = soup.find_all("article",{"itemtype":"http://schema.org/Organization"})

This will make the scraping easier on you by only grabbing data that is found within the tag containing article and itemtype = http://schema.org/Organization". That is unique enough and luckily all the entries match that format.

I also changed your csvwriter from a to wb which starts a new CSV every time the script is run and does not append to it.

Here's the final script:

import csv
import requests
from bs4 import BeautifulSoup

courses_list = []

for i in range(1907):
    url="http://www.thegolfcourses.net/page/{}?ls&location&orderby=title".format(i+1)
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    #print soup
    g_data2 = soup.find_all("article",{"itemtype":"http://schema.org/Organization"})

    for item in g_data2:
        try:
            name = item.find_all("h2",{'class':'entry-title'})[0].text
            print name
        except:
            name=''
            print "No Name found!"
        try:
            phone= item.find_all("p",{"class":"listing-phone"})[0].text
        except:
            phone=''
            print "No Phone found!"
        try:
            address= item.find_all("p",{"class":"listing-address"})[0].text
        except:
            address=''
            print "No Address found!"
        course=[name,phone,address]
        courses_list.append(course)

with open ('PGN.csv','wb') as file:
    writer=csv.writer(file)
    for row in courses_list:
        writer.writerow([s.encode("utf-8") for s in row])

Upvotes: 1

Related Questions