Tjeerd Tim
Tjeerd Tim

Reputation: 71

Requests code to scrape paginated websites

I'm trying to scrape several numbered pages (in years) from Wikipedia:

for year in range(1991, 2000, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url)

x = BeautifulSoup(source.text, "html.parser")

x

However, when inspecting 'x' I see that I downloaded only the 1999 page. How can I scrape all the pages I need years 1991 to 2000?

And put them in a dict with for each year (key) the text (value)?

Upvotes: 0

Views: 490

Answers (2)

Vikas Ojha
Vikas Ojha

Reputation: 6950

Because your x is outside the for loop. Change your code to this -

import requests
from bs4 import BeautifulSoup

res_dict = {}
for year in range(1991, 1994, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url)

    soup = BeautifulSoup(source.content, "html.parser")
    res_dict[year] = soup.text

Upvotes: 1

Remi Guan
Remi Guan

Reputation: 22282

Because for will loop the code, and...let's see an example:

for year in range(1991, 2000, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url) 

Now, the first time loop, url is https://en.wikipedia.org/wiki/1991. the second time, the url is https://en.wikipedia.org/wiki/1992.

At the last time, the url is https://en.wikipedia.org/wiki/1999. So the source is requests.get(https://en.wikipedia.org/wiki/1999)

If you don't understand me, you can try these codes:

for i in range(1, 10):
    a = i
    print(a)

print(a)

So x = BeautifulSoup(source.text, "html.parser") must inside the for loop like this:

for year in range(1991, 2000, 1):
    url = "https://en.wikipedia.org/wiki/" + str(year)
    source = requests.get(url)

    x = BeautifulSoup(source.text, "html.parser")
    x

Upvotes: 0

Related Questions