Iterating through multiple URLs from .txt file with Python/BeautifulSoup

Question

I'm trying to create a script that takes a .txt file with multiple lines of YouTube usernames, appends it to the YouTube user homepage URL, and crawls through to get profile data.

The code below gives me the info I want for one user, but I have no idea where to start for importing and iterating through multiple URLs.

#!/usr/bin/env python
# -- coding: utf-8 --
from bs4 import BeautifulSoup
import re
import urllib2

# download the page
response = urllib2.urlopen("http://youtube.com/user/alxlvt")
html = response.read()

# create a beautiful soup object
soup = BeautifulSoup(html)

# find the profile info & display it
profileinfo = soup.findAll("div", { "class" : "user-profile-item" })
for info in profileinfo:
    print info.get_text()

Does anyone have any recommendations?

Eg., if I had a .txt file that read:

username1
username2
username3
etc.

How could I go about iterating through those, appending them to http://youtube.com/user/%s, and creating a loop to pull all the info?

Jeff Tratner · Accepted Answer

If you don't want to use an actual scraping module (like scrapy, mechanize, selenium, etc), you can just keep iterating on what you've written.

use the iteration on file objects to read line by line A few things, a neat fact about file objects, is that, if they are opened with 'rb', they actually call readline() as their iterator, so you can just do for line in file_obj to go line by line in a document.
concatenate urls I used + below, but you can also use the concatenate function.

make a list of urls - will let you stagger your requests, so you can do compassionate screen scraping.

# Goal: make a list of urls
url_list = []

# use a try-finally to make sure you close your file.
try:
    f = open('pathtofile.txt','rb')
    for line in f:
        url_list.append('http://youtube.com/user/%s' % line)
    # do something with url list (like call a scraper, or use urllib2
finally:
    f.close()

EDIT: Andrew G's string format is clearer. :)

Iterating through multiple URLs from .txt file with Python/BeautifulSoup

Answers (2)

Related Questions