still_learning
still_learning

Reputation: 806

Collecting information by scraping

I am trying to collect names of politicians by scraping Wikipedia. What I would need is to scrape all parties from this page: https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito, then for each of them scrape all the names of politicians within that party (for each party listed in the link that I mentioned above).

I wrote the following code:

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito")
soup = bs(res.text, "html.parser")
array1 = {}
possible_links = soup.find_all('a')
for link in possible_links:
    url = link.get("href", "")
    if "/wiki/Provenienza" in url: # It is incomplete, as I should scrape also links including word "Politici di/dei"
        res1=requests.get("https://it.wikipedia.org"+url)
        print("https://it.wikipedia.org"+url)
        soup = bs(res1, "html.parser")
        possible_links1 = soup.find_all('a')
        for link in possible_links1:
            url_1 = link.get("href", "")
            array1[link.text.strip()] = url_1

but it does not work, as it does not collect names for each party. It collects all the parties (from the wikipedia page that I mentioned above): however, when I try to scrape the parties' pages, it does not collect the names of politician within that party.

I hope you can help me.

Upvotes: 2

Views: 94

Answers (2)

Sileo
Sileo

Reputation: 295

EDIT : Please refer to QHarr's answer above.

I have already scraped all the parties, and nothing more, I'm sharing this code and I'll edit my answer when I get all the politicians.

from bs4 import BeautifulSoup as bs
import requests

res = requests.get("https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito")
soup = bs(res.text, "html.parser")
url_list = []
politicians_dict = {}

possible_links = soup.find_all('a')
for link in possible_links:
    url = link.get("href", "")
    if (("/wiki/Provenienza" in url) or ("/wiki/Categoria:Politici_d" in url)):
        full_url = "https://it.wikipedia.org"+url
        url_list.append(full_url)

for url in url_list:
    print(url)

Upvotes: 1

QHarr
QHarr

Reputation: 84465

You could collect the urls and party names from first page and then loop those urls and add the list of associated politician names to a dict whose key is the party name. You would gain efficiency from using a session object and thereby re-use underlying tcp connection

from bs4 import BeautifulSoup as bs
import requests

results = {}

with requests.Session() as s: # use session object for efficiency of tcp re-use
    s.headers = {'User-Agent': 'Mozilla/5.0'}
    r = s.get('https://it.wikipedia.org/wiki/Categoria:Politici_italiani_per_partito')
    soup = bs(r.content, 'lxml')
    party_info = {i.text:'https://it.wikipedia.org/' + i['href'] for i in soup.select('.CategoryTreeItem a')} #dict of party names and party links

    for party, link in party_info.items():
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        results[party] = [i.text for i in soup.select('.mw-content-ltr .mw-content-ltr a')] # get politicians names 

Upvotes: 4

Related Questions