Reputation: 71
I'm trying to scrape several numbered pages (in years) from Wikipedia:
for year in range(1991, 2000, 1):
url = "https://en.wikipedia.org/wiki/" + str(year)
source = requests.get(url)
x = BeautifulSoup(source.text, "html.parser")
x
However, when inspecting 'x' I see that I downloaded only the 1999 page. How can I scrape all the pages I need years 1991 to 2000?
And put them in a dict with for each year (key) the text (value)?
Upvotes: 0
Views: 490
Reputation: 6950
Because your x is outside the for loop. Change your code to this -
import requests
from bs4 import BeautifulSoup
res_dict = {}
for year in range(1991, 1994, 1):
url = "https://en.wikipedia.org/wiki/" + str(year)
source = requests.get(url)
soup = BeautifulSoup(source.content, "html.parser")
res_dict[year] = soup.text
Upvotes: 1
Reputation: 22282
Because for
will loop the code, and...let's see an example:
for year in range(1991, 2000, 1):
url = "https://en.wikipedia.org/wiki/" + str(year)
source = requests.get(url)
Now, the first time loop, url
is https://en.wikipedia.org/wiki/1991
.
the second time, the url
is https://en.wikipedia.org/wiki/1992
.
At the last time, the url is https://en.wikipedia.org/wiki/1999
. So the source
is requests.get(https://en.wikipedia.org/wiki/1999)
If you don't understand me, you can try these codes:
for i in range(1, 10):
a = i
print(a)
print(a)
So x = BeautifulSoup(source.text, "html.parser")
must inside the for
loop like this:
for year in range(1991, 2000, 1):
url = "https://en.wikipedia.org/wiki/" + str(year)
source = requests.get(url)
x = BeautifulSoup(source.text, "html.parser")
x
Upvotes: 0