Reputation: 49
I have a list of URLs I want to scrape.
The code I have works on the list if I use each URL by itself; however when I store the URLS in a file and use it in a loop it only goes up to the second URL and stops at the 3rd.
This is my code:
urls=open("file.txt")
url=urls.read()
main=url.split("\n")
url_number=0
while url_number<len(main):
page = requests.get(main[url_number])
tree = html.fromstring(page.text)
tournament = tree.xpath('//title/text()')
round1= tree.xpath('//div[@data-round]/span/text()')
scoreup= tree.xpath('//div[contains(@class, "top_score")]/text()')
scoredown= tree.xpath('//div[contains(@class, "bottom_score")]/text()')
url_number=url_number+1
print url_number
print "\n"
results = []
score_number=0
round_number=0
match_number=0
while round_number < len(round1):
match_number +=1
results.append(
[match_number,
round1[round_number],
scoreup[score_number],
round1[round_number+1],
scoredown[score_number],
tournament,])
round_number=round_number+2
score_number=score_number+1
print results
This code gives me up to the 2nd URL and prints only 3
for the 3rd one, which is url_number
followed by this error.
scoredown[score_number],
IndexError: list index out of range
Upvotes: 0
Views: 251
Reputation: 56634
I've rewritten this to take advantage of Python iterators instead of indexed loops:
from itertools import count, cycle, islice, izip
import lxml.html as lh
import requests
URL_FILE = "file.txt"
ROW_FORMAT = "{:>5} {:18} {:>5} {:18} {:>5} {}".format
HEADER = ['Match', 'Up', 'Score', 'Down', 'Score', 'Tournament']
lookfor = {
'tournament': '//title/text()',
'round1': '//div[@data-round]/span/text()',
'scoreup': '//div[contains(@class, "top_score")]/text()',
'scoredown': '//div[contains(@class, "bottom_score")]/text()'
}
def main():
with open(URL_FILE) as inf:
urls = (line.strip() for line in inf)
for num,url in enumerate(urls, 1):
# get data
txt = requests.get(url).text
tree = lh.fromstring(txt)
# pull out the bits we want
scraped = {name:tree.xpath(path) for name,path in lookfor.items()}
# (and fix the title)
title = scraped['tournament']
tournament = title[0].replace('\n', '') if title else ''
# reslice the data so it lines up
match_num = count(1) # 1, 2, 3, ...
up_rounds = islice(scraped['round1'], 0, None, 2) # even rounds
down_rounds = islice(scraped['round1'], 1, None, 2) # odd rounds
tournament = cycle([tournament]) # repeats the name
# ... and reassemble it
results = izip(match_num, up_rounds, scraped['scoreup'], down_rounds, scraped['scoredown'], tournament)
# generate output
print("\nRound {}:".format(num))
print(ROW_FORMAT(*HEADER))
for row in results:
print(ROW_FORMAT(*row))
if __name__=="__main__":
main()
which, on the given url, results in:
Round 1:
Match Up Score Down Score Tournament
1 Halcyon.680 2 Dubrick.528 0 IG Spring 2011 Omega Divisional #11 - Challonge
2 rabidsnowman.208 0 Drunkenboi.856 2 IG Spring 2011 Omega Divisional #11 - Challonge
3 GoSuRum.612 2 Halcyon.680 1 IG Spring 2011 Omega Divisional #11 - Challonge
4 Hummingbird.656 1 hammy.161 2 IG Spring 2011 Omega Divisional #11 - Challonge
5 Cryptic.528 2 Divination.275 0 IG Spring 2011 Omega Divisional #11 - Challonge
6 tqyrusecc.243 2 Kodak.775 1 IG Spring 2011 Omega Divisional #11 - Challonge
7 coLrsvp.138 0 Drunkenboi.856 1 IG Spring 2011 Omega Divisional #11 - Challonge
8 Sharo.803 0 ices.813 2 IG Spring 2011 Omega Divisional #11 - Challonge
9 vpchance.970 2 hayes.848 0 IG Spring 2011 Omega Divisional #11 - Challonge
10 Leif.812 0 Amalaxinaoum.405 0 IG Spring 2011 Omega Divisional #11 - Challonge
11 GoSuRum.612 0 hammy.161 2 IG Spring 2011 Omega Divisional #11 - Challonge
12 Cryptic.528 0 tqyrusecc.243 2 IG Spring 2011 Omega Divisional #11 - Challonge
13 Drunkenboi.856 1 ices.813 2 IG Spring 2011 Omega Divisional #11 - Challonge
14 vpchance.970 2 Amalaxinaoum.405 0 IG Spring 2011 Omega Divisional #11 - Challonge
15 hammy.161 0 tqyrusecc.243 2 IG Spring 2011 Omega Divisional #11 - Challonge
16 ices.813 1 vpchance.970 2 IG Spring 2011 Omega Divisional #11 - Challonge
17 tqyrusecc.243 0 vpchance.970 3 IG Spring 2011 Omega Divisional #11 - Challonge
Upvotes: 1
Reputation: 21914
I've got a few suggestions for your code, which should help you avoid issues like this in the future. Mainly there's no reason to be using all of those while loops. You would be much better suited to just explicitly loop through your lists using a for loop.
Also python has a lot of builtins that mean you don't have to do the dirty work. Together these two can transform the first part of your code into this:
for url in open('file.txt').readlines():
It's hard to be entirely sure without seeing the urls you're scraping, but I'm willing to bet that the sizes of your return lists aren't as consistent as you think they are.
Your xpath
selectors don't look particularly narrow, and since you're using a while loop instead of explicitly looping through your results if you get a number of values in your round1
, scoreup
, or scoredown
lists, slightly different from what you expected, you'd get errors like this.
Upvotes: 1