Reputation: 675
I'm trying to create a script that takes a .txt file with multiple lines of YouTube usernames, appends it to the YouTube user homepage URL, and crawls through to get profile data.
The code below gives me the info I want for one user, but I have no idea where to start for importing and iterating through multiple URLs.
#!/usr/bin/env python
# -- coding: utf-8 --
from bs4 import BeautifulSoup
import re
import urllib2
# download the page
response = urllib2.urlopen("http://youtube.com/user/alxlvt")
html = response.read()
# create a beautiful soup object
soup = BeautifulSoup(html)
# find the profile info & display it
profileinfo = soup.findAll("div", { "class" : "user-profile-item" })
for info in profileinfo:
print info.get_text()
Does anyone have any recommendations?
Eg., if I had a .txt file that read:
username1
username2
username3
etc.
How could I go about iterating through those, appending them to http://youtube.com/user/%s, and creating a loop to pull all the info?
Upvotes: 1
Views: 3180
Reputation: 17076
If you don't want to use an actual scraping module (like scrapy, mechanize, selenium, etc), you can just keep iterating on what you've written.
for line in file_obj
to go line by line in a document.+
below, but you can also use the concatenate function.make a list of urls - will let you stagger your requests, so you can do compassionate screen scraping.
# Goal: make a list of urls
url_list = []
# use a try-finally to make sure you close your file.
try:
f = open('pathtofile.txt','rb')
for line in f:
url_list.append('http://youtube.com/user/%s' % line)
# do something with url list (like call a scraper, or use urllib2
finally:
f.close()
EDIT: Andrew G's string format is clearer. :)
Upvotes: 2
Reputation: 19973
You'll need to open the file (preferably with the with open('/path/to/file', 'r') as f:
syntax) and then do f.readline()
in a loop. Assign the results of readline() to a string like "username" and then run your current code inside the loop, starting with response = urllib2.urlopen("http://youtube.com/user/%s" % username)
.
Upvotes: 0