blueinc
blueinc

Reputation: 49

Python Requests Scraping

I have a list of URLs I want to scrape.

The code I have works on the list if I use each URL by itself; however when I store the URLS in a file and use it in a loop it only goes up to the second URL and stops at the 3rd.

This is my code:

urls=open("file.txt")

url=urls.read()


main=url.split("\n")

url_number=0
while url_number<len(main):
    page = requests.get(main[url_number])
    tree = html.fromstring(page.text)


    tournament = tree.xpath('//title/text()')

    round1= tree.xpath('//div[@data-round]/span/text()')
    scoreup= tree.xpath('//div[contains(@class, "top_score")]/text()')
    scoredown= tree.xpath('//div[contains(@class, "bottom_score")]/text()')

    url_number=url_number+1
    print url_number

    print "\n"
    results = [] 
    score_number=0
    round_number=0
    match_number=0

    while round_number < len(round1):
        match_number +=1
        results.append(
                        [match_number,
                        round1[round_number],
                        scoreup[score_number],
                        round1[round_number+1],
                        scoredown[score_number],
                        tournament,])
        round_number=round_number+2
        score_number=score_number+1


    print results

This code gives me up to the 2nd URL and prints only 3 for the 3rd one, which is url_number followed by this error.

scoredown[score_number],
IndexError: list index out of range

Upvotes: 0

Views: 251

Answers (2)

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

I've rewritten this to take advantage of Python iterators instead of indexed loops:

from itertools import count, cycle, islice, izip
import lxml.html as lh
import requests

URL_FILE   = "file.txt"
ROW_FORMAT = "{:>5}     {:18} {:>5}     {:18} {:>5}     {}".format
HEADER     = ['Match', 'Up', 'Score', 'Down', 'Score', 'Tournament']

lookfor = {
    'tournament': '//title/text()',
    'round1':     '//div[@data-round]/span/text()',
    'scoreup':    '//div[contains(@class, "top_score")]/text()',
    'scoredown':  '//div[contains(@class, "bottom_score")]/text()'
}

def main():
    with open(URL_FILE) as inf:
        urls = (line.strip() for line in inf)
        for num,url in enumerate(urls, 1):
            # get data
            txt = requests.get(url).text
            tree = lh.fromstring(txt)

            # pull out the bits we want        
            scraped = {name:tree.xpath(path) for name,path in lookfor.items()}
            # (and fix the title)
            title = scraped['tournament']
            tournament = title[0].replace('\n', '') if title else ''

            # reslice the data so it lines up        
            match_num   = count(1)                               # 1, 2, 3, ...
            up_rounds   = islice(scraped['round1'], 0, None, 2)  # even rounds
            down_rounds = islice(scraped['round1'], 1, None, 2)  # odd rounds
            tournament  = cycle([tournament])      # repeats the name
            # ... and reassemble it        
            results = izip(match_num, up_rounds, scraped['scoreup'], down_rounds, scraped['scoredown'], tournament)

            # generate output
            print("\nRound {}:".format(num))

            print(ROW_FORMAT(*HEADER))
            for row in results:
                print(ROW_FORMAT(*row))

if __name__=="__main__":
    main()

which, on the given url, results in:

Round 1:
Match     Up                 Score     Down               Score     Tournament
    1     Halcyon.680            2     Dubrick.528            0     IG Spring 2011 Omega Divisional #11 - Challonge
    2     rabidsnowman.208       0     Drunkenboi.856         2     IG Spring 2011 Omega Divisional #11 - Challonge
    3     GoSuRum.612            2     Halcyon.680            1     IG Spring 2011 Omega Divisional #11 - Challonge
    4     Hummingbird.656        1     hammy.161              2     IG Spring 2011 Omega Divisional #11 - Challonge
    5     Cryptic.528            2     Divination.275         0     IG Spring 2011 Omega Divisional #11 - Challonge
    6     tqyrusecc.243          2     Kodak.775              1     IG Spring 2011 Omega Divisional #11 - Challonge
    7     coLrsvp.138            0     Drunkenboi.856         1     IG Spring 2011 Omega Divisional #11 - Challonge
    8     Sharo.803              0     ices.813               2     IG Spring 2011 Omega Divisional #11 - Challonge
    9     vpchance.970           2     hayes.848              0     IG Spring 2011 Omega Divisional #11 - Challonge
   10     Leif.812               0     Amalaxinaoum.405       0     IG Spring 2011 Omega Divisional #11 - Challonge
   11     GoSuRum.612            0     hammy.161              2     IG Spring 2011 Omega Divisional #11 - Challonge
   12     Cryptic.528            0     tqyrusecc.243          2     IG Spring 2011 Omega Divisional #11 - Challonge
   13     Drunkenboi.856         1     ices.813               2     IG Spring 2011 Omega Divisional #11 - Challonge
   14     vpchance.970           2     Amalaxinaoum.405       0     IG Spring 2011 Omega Divisional #11 - Challonge
   15     hammy.161              0     tqyrusecc.243          2     IG Spring 2011 Omega Divisional #11 - Challonge
   16     ices.813               1     vpchance.970           2     IG Spring 2011 Omega Divisional #11 - Challonge
   17     tqyrusecc.243          0     vpchance.970           3     IG Spring 2011 Omega Divisional #11 - Challonge

Upvotes: 1

Slater Victoroff
Slater Victoroff

Reputation: 21914

I've got a few suggestions for your code, which should help you avoid issues like this in the future. Mainly there's no reason to be using all of those while loops. You would be much better suited to just explicitly loop through your lists using a for loop.

Also python has a lot of builtins that mean you don't have to do the dirty work. Together these two can transform the first part of your code into this:

for url in open('file.txt').readlines():

It's hard to be entirely sure without seeing the urls you're scraping, but I'm willing to bet that the sizes of your return lists aren't as consistent as you think they are.

Your xpath selectors don't look particularly narrow, and since you're using a while loop instead of explicitly looping through your results if you get a number of values in your round1, scoreup, or scoredown lists, slightly different from what you expected, you'd get errors like this.

Upvotes: 1

Related Questions