Looping through a list of urls for web scraping with BeautifulSoup

Question

I want to extract some information off websites with URLs of the form: http://www.pedigreequery.com/american+pharoah where "american+pharoah" is the extension for one of many horse names. I have a list of the horse names I'm searching for, I just need to figure out how to plug the names in after "http://www.pedigreequery.com/"

This is what I currently have:

import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)

import requests 
from bs4 import BeautifulSoup
for i in rows:      # Number of pages plus one 
    url = "http://www.pedigreequery.com/".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    letters = soup.find_all("a", class_="horseName")
    print(letters)

When I print out the url it doesn't have the horse's name at the end, just the URL in quotes. the letters/print statement at the end are just to check if it's actually going to the website. This is how I've seen it done for looping URLs that change by numbers at the end- I haven't found advice on URLs that change by characters.

Thanks!

Padraic Cunningham · Accepted Answer

You are missing the placeholder in your format so scange the format to:

url = "http://www.pedigreequery.com/{}".format(i)
                                     ^
                                   #add placeholder

Also you are getting a list of lists at best from rows=list(allhorses) so you would be passing a list not a string/horsename, just open the file normally if you have a horse per line and iterate over the file object stripping the newline.

Presuming one horse name per line, the whole working code would be:

import requests
from bs4 import BeautifulSoup

with open("HORSES.csv") as f:
    for horse in map(str.strip,f):      # Number of pages plus one
        url = "http://www.pedigreequery.com/{}".format(horse)
        r = requests.get(url)
        soup = BeautifulSoup(r.content)
        letters = soup.find_all("a", class_="horseName")
        print(letters)

If you have multiple horses per line you can use the csv lib but you will need an inner loop:

with open("HORSES.csv") as f:
    for row in csv.reader(f):   
        # Number of pages plus one
        for horse in row:
            url = "http://www.pedigreequery.com/{}".format(horse)
            r = requests.get(url)
            soup = BeautifulSoup(r.content)
            letters = soup.find_all("a", class_="horseName")
            print(letters)

Lastly if you don't have the names store correctly you have a few options the simplest of which is to split and create the create the query manually.

  url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))

Looping through a list of urls for web scraping with BeautifulSoup

Answers (1)

Related Questions