Reputation: 9
I am having trouble in scraping a list of hits. For each year there is a hit list in a certain webpage with a certain url. The url contains the year so I'd like to make a single csv file for each year with the hit list.
Unfortunately I cannot make it sequentially and I get the following error:
ValueError: unknown url type: 'h'
Here is the code I am trying to use. I apologize if there are simple mistakes but I'm a newbie in pyhon and I couldn't find any sequence in the forum to adapt to this case.
import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1947,2016))
for year in years:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm')
my_url = my_urls[0]
for my_url in my_urls:
uClient = uReq(my_url)
html_input = uClient.read()
uClient.close()
page_soup = BeautifulSoup(html_input, "html.parser")
container = page_soup.findAll("li")
filename = "singoli" + str(year) + ".csv"
f = open(singoli + str(year), "w")
headers = "lista"
f.write(headers)
lista = container.text
print("lista: " + lista)
f.write(lista + "\n")
f.close()
Upvotes: 0
Views: 274
Reputation: 22440
Try this. Hope it will solve the issue:
import csv
import urllib.request
from bs4 import BeautifulSoup
outfile = open("hitparade.csv","w",newline='',encoding='utf8')
writer = csv.writer(outfile)
for year in range(1947,2016):
my_urls = urllib.request.urlopen('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm').read()
soup = BeautifulSoup(my_urls, "lxml")
[scr.extract() for scr in soup('script')]
for container in soup.select(".li1,.liy,li"):
writer.writerow([container.text.strip()])
print("lista: " + container.text.strip())
outfile.close()
Upvotes: 0
Reputation: 10403
You think you are defining a tuple with ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm')
but you just defined a simple string.
So you are looping in a string, so looping letter by letter, not url by url.
When you want to define a tuple with one single element you have to explicit it with a ending ,
, example: ("foo",)
.
Fix:
my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm', )
A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to accommodate these. Empty tuples are constructed by an empty pair of parentheses; a tuple with one item is constructed by following a value with a comma (it is not sufficient to enclose a single value in parentheses). Ugly, but effective.
Upvotes: 1