automation in scraping multiple pages with known url scheme

I am having trouble in scraping a list of hits. For each year there is a hit list in a certain webpage with a certain url. The url contains the year so I'd like to make a single csv file for each year with the hit list.

Unfortunately I cannot make it sequentially and I get the following error:

ValueError: unknown url type: 'h'

Here is the code I am trying to use. I apologize if there are simple mistakes but I'm a newbie in pyhon and I couldn't find any sequence in the forum to adapt to this case.

import urllib
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
years = list(range(1947,2016))

for year in years:
    my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm')
    my_url = my_urls[0]
    for my_url in my_urls:
        uClient = uReq(my_url)
        html_input = uClient.read()
        uClient.close()
        page_soup = BeautifulSoup(html_input, "html.parser")
        container = page_soup.findAll("li")
        filename = "singoli" + str(year) + ".csv"
        f = open(singoli + str(year), "w")
        headers = "lista"
        f.write(headers)
        lista = container.text
        print("lista: " + lista)
        f.write(lista + "\n")
        f.close()

Upvotes: 0

Answers (2)

SIM

Reputation: 22440

Try this. Hope it will solve the issue:

import csv
import urllib.request
from bs4 import BeautifulSoup

outfile = open("hitparade.csv","w",newline='',encoding='utf8')
writer = csv.writer(outfile)

for year in range(1947,2016):
    my_urls = urllib.request.urlopen('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm').read()
    soup = BeautifulSoup(my_urls, "lxml")
    [scr.extract() for scr in soup('script')]
    for container in soup.select(".li1,.liy,li"):
        writer.writerow([container.text.strip()])
        print("lista: " + container.text.strip())
outfile.close()

Upvotes: 0

Arount

Reputation: 10403

You think you are defining a tuple with ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm') but you just defined a simple string.

So you are looping in a string, so looping letter by letter, not url by url.

When you want to define a tuple with one single element you have to explicit it with a ending ,, example: ("foo",).

Fix:

my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm', )

Reference:

A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to accommodate these. Empty tuples are constructed by an empty pair of parentheses; a tuple with one item is constructed by following a value with a comma (it is not sufficient to enclose a single value in parentheses). Ugly, but effective.

Upvotes: 1

automation in scraping multiple pages with known url scheme

Answers (2)

Related Questions