Reputation: 55
I am currently learning webscraping and Python. I want to write a code that downloads a list of .xls data files based on a list of links that I have created. Each of these links downloads a data file that corresponds to FDI flows of a country.
My problem is that with the way the code is currently written, the last url in my list replaces all the previous files. The files are named correctly but they all contain the data for the last country in the list. As an example, I am only taking the last three countries in the data.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
page = requests.get("https://unctad.org/en/Pages/DIAE/FDI%20Statistics/FDI-Statistics-Bilateral.aspx")
soup = BeautifulSoup(page.text, 'html.parser')
countries_list = soup.select('[id=FDIcountriesxls] option[value]')
links = [link.get('value') for link in countries_list[203:-1]] #sample of countries
countries = [country.text for country in countries_list[203:-1]] #sample of countries
links_complete = ["https://unctad.org" + link for link in links]
for link in links_complete:
for country in countries:
r=requests.get(link)
with open (country + '.xls', 'wb') as file:
file.write(r.content)
What this gets me is three files, all named after the three countries but containing the data for the last (Zambia).
Can anyone help with this?
Thanks.
Upvotes: 0
Views: 94
Reputation: 2445
That's because you don't have to do a double loop. Indeed, in the "countries" loop, you rewrite each time on your file ('wb') so there are only the values of the last country left. To solve your problem you can do a loop on countries_list directly
from bs4 import BeautifulSoup
import pandas as pd
import requests
import os
page = requests.get("https://unctad.org/en/Pages/DIAE/FDI%20Statistics/FDI-Statistics-Bilateral.aspx")
soup = BeautifulSoup(page.text, 'html.parser')
countries_list = soup.select('[id=FDIcountriesxls] option[value]')
for opt in countries_list:
value = opt.get('value')
if value:
link = "https://unctad.org" + value
country = opt.get_text()
r = requests.get(link)
with open(country + '.xls', 'wb') as file:
file.write(r.content)
Upvotes: 1