user12092724
user12092724

Reputation:

Scraping lists of items from Wikipedia

I would need to get all the information from this page:

https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana

from symbol " to letter Z.

Then:

"
"900", Cahiers d'Italie et d'Europe
A
Abitare
Aerei
Aeronautica & Difesa
Airone (periodico)
Alp (periodico)
Alto Adige (quotidiano)
Altreconomia
....

In order to do this, I have tried using the following code:

res = requests.get("https://it.wikipedia.org/wiki/Categoria:Periodici_italiani_in_lingua_italiana")
soup = bs(res.text, "html.parser")
url_list = []

links = soup.find_all('a')
for link in links:
    url = link.get("href", "")
    url_list.append(url)

lists_A=[]

for url in url_list:
      lists_A(url)

print(lists_A)

However this code collects more information than what I would need. In particular, the last item that I should collect would be La Zanzara (possibly all the items should not have any word in the brackets, i.e. they should not contain (rivista), (periodico), (settimanale), and so on, but just the title (e.g. Jack (periodico) should be just Jack).

Could you give me any advice on how to get this information? Thanks

Upvotes: 0

Views: 145

Answers (1)

NomadMonad
NomadMonad

Reputation: 649

This will help you to filter out some of the unwanted urls (not all though). Basically everything before "Corriere della Sera", which I'm assuming should be the first expected URL.

links = [a.get('href') for a in soup.find_all('a', {'title': True, 'href': re.compile('/wiki/(.*)'), 'accesskey': False})]

You can safely assume that all the magazine URLs are ordered at this point and since you know that "La Zanzara" should be the last expected URL you can get the position of that particular string in your new list and slice up to that index + 1

links.index('/wiki/La_zanzara_(periodico)')
Out[20]: 144

links = links[:145]

As for removing ('periodico') and other data cleaning you need to inspect your data and figure out what is it that you want to remove.

Write a simple function like this maybe:

def clean(string):
    to_remove = ['_(periodico)', '_(quotidiano)']
    for s in to_remove:
        if s in string:
            return replace(string, s, '')

Upvotes: 1

Related Questions