alxlvt
alxlvt

Reputation: 675

Iterating through multiple URLs from .txt file with Python/BeautifulSoup

I'm trying to create a script that takes a .txt file with multiple lines of YouTube usernames, appends it to the YouTube user homepage URL, and crawls through to get profile data.

The code below gives me the info I want for one user, but I have no idea where to start for importing and iterating through multiple URLs.

#!/usr/bin/env python
# -- coding: utf-8 --
from bs4 import BeautifulSoup
import re
import urllib2

# download the page
response = urllib2.urlopen("http://youtube.com/user/alxlvt")
html = response.read()

# create a beautiful soup object
soup = BeautifulSoup(html)

# find the profile info & display it
profileinfo = soup.findAll("div", { "class" : "user-profile-item" })
for info in profileinfo:
    print info.get_text()

Does anyone have any recommendations?

Eg., if I had a .txt file that read:

username1
username2
username3
etc.

How could I go about iterating through those, appending them to http://youtube.com/user/%s, and creating a loop to pull all the info?

Upvotes: 1

Views: 3180

Answers (2)

Jeff Tratner
Jeff Tratner

Reputation: 17076

If you don't want to use an actual scraping module (like scrapy, mechanize, selenium, etc), you can just keep iterating on what you've written.

  1. use the iteration on file objects to read line by line A few things, a neat fact about file objects, is that, if they are opened with 'rb', they actually call readline() as their iterator, so you can just do for line in file_obj to go line by line in a document.
  2. concatenate urls I used + below, but you can also use the concatenate function.
  3. make a list of urls - will let you stagger your requests, so you can do compassionate screen scraping.

    # Goal: make a list of urls
    url_list = []
    
    # use a try-finally to make sure you close your file.
    try:
        f = open('pathtofile.txt','rb')
        for line in f:
            url_list.append('http://youtube.com/user/%s' % line)
        # do something with url list (like call a scraper, or use urllib2
    finally:
        f.close()
    

EDIT: Andrew G's string format is clearer. :)

Upvotes: 2

Andrew Gorcester
Andrew Gorcester

Reputation: 19973

You'll need to open the file (preferably with the with open('/path/to/file', 'r') as f: syntax) and then do f.readline() in a loop. Assign the results of readline() to a string like "username" and then run your current code inside the loop, starting with response = urllib2.urlopen("http://youtube.com/user/%s" % username).

Upvotes: 0

Related Questions