user8229029
user8229029

Reputation: 1162

Parsing xml file using Python3 and BeautifulSoup

I know there are several answers to questions regarding xml parsing with Python 3, but I can't find the answer to two that I have. I am trying to parse and extract information from a BoardGameGeek xml file that looks like the following (it's too long for me to paste in here):

https://www.boardgamegeek.com/xmlapi/boardgame/10

1) I am having trouble extracting the primary game name from these two lines:

<name sortindex="1" primary="true">Elfenland</name>
<name sortindex="1">Elfenland (Волшебное Путешествие)</name>

2) I am also having trouble extracting lists of data, such as in this xml:

<poll title="User Suggested Number of Players" totalvotes="96"  name="suggested_numplayers">
    <results numplayers="1">
        <result numvotes="0" value="Best"/>
        <result numvotes="0" value="Recommended"/>
        <result numvotes="58" value="Not Recommended"/>
    </results>
    <results numplayers="2">
        <result numvotes="2" value="Best"/>
        <result numvotes="21" value="Recommended"/>
        <result numvotes="53" value="Not Recommended"/>
    </results>
    <results numplayers="3">
        <result numvotes="10" value="Best"/>
        <result numvotes="46" value="Recommended"/>
        <result numvotes="17" value="Not Recommended"/>
    </results>
        <results numplayers="4">
        <result numvotes="47" value="Best"/>
        <result numvotes="36" value="Recommended"/>
        <result numvotes="1" value="Not Recommended"/>
    </results>
    <results numplayers="5">
        <result numvotes="35" value="Best"/>
        <result numvotes="44" value="Recommended"/>
        <result numvotes="2" value="Not Recommended"/>
    </results>
    <results numplayers="6">
        <result numvotes="23" value="Best"/>
        <result numvotes="48" value="Recommended"/>
        <result numvotes="11" value="Not Recommended"/>
    </results>
    <results numplayers="6+">
        <result numvotes="0" value="Best"/>
        <result numvotes="1" value="Recommended"/>
        <result numvotes="46" value="Not Recommended"/>
    </results>
</poll>

Currently, my code is very simple, and looks like this. It only extracts simple one value xml lines. Any help on how to extract the more complex information would be great. Thank you.

url = 'https://www.boardgamegeek.com/xmlapi/boardgame/10'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # a `str`; 
soup = BeautifulSoup(text,'xml')
yearpublished = soup.find_all('yearpublished')

Upvotes: 2

Views: 7394

Answers (1)

Dan-Dev
Dan-Dev

Reputation: 9440

For the first part try searching for the element "name" where the attribute "primary" is present like this:

from bs4 import BeautifulSoup
import urllib

url = 'https://www.boardgamegeek.com/xmlapi/boardgame/10'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8') # a `str`;
soup = BeautifulSoup(text,'xml')
name = soup.find('name', primary = True)

print (name.get_text())

Outputs:

Elfenland

For the second loop over the "results" elements and extract the data you want:

text = """
<poll title="User Suggested Number of Players" totalvotes="96"  name="suggested_numplayers">
    <results numplayers="1">
        <result numvotes="0" value="Best"/>
...
        <result numvotes="46" value="Not Recommended"/>
    </results>
</poll>
"""
soup = BeautifulSoup(text,'xml')

for result in soup.find_all('results'):
    numplayers = result['numplayers']
    best = result.find('result', {'value': 'Best'})['numvotes']
    recommended = result.find('result', {'value': 'Recommended'})['numvotes']
    not_recommended = result.find('result', {'value': 'Not Recommended'})['numvotes']
    print (numplayers, best, recommended, not_recommended)

Outputs:

1 0 0 58
2 2 21 53
3 10 46 17
4 47 36 1
5 35 44 2
6 23 48 11
6+ 0 1 46

Or if you want to do it more elegantly find all of each attribute and zip them:

soup = BeautifulSoup(text,'xml')
numplayers = [tag['numplayers'] for tag in soup.find_all('results')]
best = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Best'})]
recommended = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Recommended'})]
not_recommended = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Not Recommended'})]
print(list(zip(numplayers, best, recommended, not_recommended)))

Outputs:

[('1', '0', '0', '58'), ('2', '2', '21', '53'), ('3', '10', '46', '17'), ('4', '47', '36', '1'), ('5', '35', '44', '2'), ('6', '23', '48', '11'), ('6+', '0', '1', '46')]

Upvotes: 6

Related Questions