Harald
Harald

Reputation: 21

Scraping name list with varying numbers of last names

Trying to scrape swedish members of parliament with Beautiful Soup. When I run the scraper I get "ValueError: too many values to unpack (expected 3)".

The script outputs a csv, but only with five names. The sixth person on the list is named Alm Ericson, Janine (MP). I suppose the problem is that she has two last names - Alm Ericson, and the code only expects three values, firstname, lastname and party.

How should I code the field-split to make this work also for double last names?

The names on the page are written as

Last_name, first_name (party)

Code:

import urllib.request
import bs4 as bs
import csv

source = urllib.request.urlopen("https://www.riksdagen.se/sv/ledamoter-partier/").read()
soup = bs.BeautifulSoup(source, "lxml")

data = []

for span in soup.find_all("span", {"class": "fellow-name"}):
    cleanednames = span.text.strip()
    data.append(cleanednames)  #fields are appended to list rather printing

with open("riksdagsledamoter.csv", "w") as stream:
    fieldnames = ["Last_Name","First_Name","Party"]
    var = csv.DictWriter(stream, fieldnames=fieldnames)
    var.writeheader()
    for item in data:
        last_name, First_name, party = item.split()  #splitting data in 3 fields
        last_name = last_name.replace(",","")  #removing ',' from last name
        party = party.replace("(","").replace(")","")  #removing "()" from party
        var.writerow({"Last_Name": last_name,"First_Name": First_name, "Party": party})  #writing to csv row

Upvotes: 1

Views: 108

Answers (3)

BlueSheepToken
BlueSheepToken

Reputation: 6099

Here is a simple regex that should do the trick

 import re
 print(re.match("(.*), (.*) \((.*)\)", 'Alm Ericson, Janine (MP)').groups())

Inspired from Corentin's answer

Upvotes: 2

Corentin Limier
Corentin Limier

Reputation: 5006

Well obviously splitting is not a good solution here. (or you should split on comma and parenthesis instead of spaces)

Using regexp :

import re
re.match('([^,]*), ([^(]*) \((.*)\)', 'Alm Ericson, Janine (MP)').groups()

Returns

('Alm Ericson', 'Janine', 'MP')

Upvotes: 4

QHarr
QHarr

Reputation: 84465

I guess you could also use a function to return the parts in a list (not as clean as answer already give) e.g.

def getParts(inputString):
    list1 = inputString.split(",")
    list2 = list1[1].split("(")
    finalList = [list1[0], list2[0].strip(),list2[1].replace(")","")]
    return finalList

inputString = 'Alm Ericson, Janine (MP)'

print(getParts(s))

Upvotes: 0

Related Questions