Reputation: 1162
I know there are several answers to questions regarding xml parsing with Python 3, but I can't find the answer to two that I have. I am trying to parse and extract information from a BoardGameGeek xml file that looks like the following (it's too long for me to paste in here):
https://www.boardgamegeek.com/xmlapi/boardgame/10
1) I am having trouble extracting the primary game name from these two lines:
<name sortindex="1" primary="true">Elfenland</name>
<name sortindex="1">Elfenland (Волшебное Путешествие)</name>
2) I am also having trouble extracting lists of data, such as in this xml:
<poll title="User Suggested Number of Players" totalvotes="96" name="suggested_numplayers">
<results numplayers="1">
<result numvotes="0" value="Best"/>
<result numvotes="0" value="Recommended"/>
<result numvotes="58" value="Not Recommended"/>
</results>
<results numplayers="2">
<result numvotes="2" value="Best"/>
<result numvotes="21" value="Recommended"/>
<result numvotes="53" value="Not Recommended"/>
</results>
<results numplayers="3">
<result numvotes="10" value="Best"/>
<result numvotes="46" value="Recommended"/>
<result numvotes="17" value="Not Recommended"/>
</results>
<results numplayers="4">
<result numvotes="47" value="Best"/>
<result numvotes="36" value="Recommended"/>
<result numvotes="1" value="Not Recommended"/>
</results>
<results numplayers="5">
<result numvotes="35" value="Best"/>
<result numvotes="44" value="Recommended"/>
<result numvotes="2" value="Not Recommended"/>
</results>
<results numplayers="6">
<result numvotes="23" value="Best"/>
<result numvotes="48" value="Recommended"/>
<result numvotes="11" value="Not Recommended"/>
</results>
<results numplayers="6+">
<result numvotes="0" value="Best"/>
<result numvotes="1" value="Recommended"/>
<result numvotes="46" value="Not Recommended"/>
</results>
</poll>
Currently, my code is very simple, and looks like this. It only extracts simple one value xml lines. Any help on how to extract the more complex information would be great. Thank you.
url = 'https://www.boardgamegeek.com/xmlapi/boardgame/10'
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8') # a `str`;
soup = BeautifulSoup(text,'xml')
yearpublished = soup.find_all('yearpublished')
Upvotes: 2
Views: 7394
Reputation: 9440
For the first part try searching for the element "name" where the attribute "primary" is present like this:
from bs4 import BeautifulSoup
import urllib
url = 'https://www.boardgamegeek.com/xmlapi/boardgame/10'
response = urllib.request.urlopen(url)
data = response.read() # a `bytes` object
text = data.decode('utf-8') # a `str`;
soup = BeautifulSoup(text,'xml')
name = soup.find('name', primary = True)
print (name.get_text())
Outputs:
Elfenland
For the second loop over the "results" elements and extract the data you want:
text = """
<poll title="User Suggested Number of Players" totalvotes="96" name="suggested_numplayers">
<results numplayers="1">
<result numvotes="0" value="Best"/>
...
<result numvotes="46" value="Not Recommended"/>
</results>
</poll>
"""
soup = BeautifulSoup(text,'xml')
for result in soup.find_all('results'):
numplayers = result['numplayers']
best = result.find('result', {'value': 'Best'})['numvotes']
recommended = result.find('result', {'value': 'Recommended'})['numvotes']
not_recommended = result.find('result', {'value': 'Not Recommended'})['numvotes']
print (numplayers, best, recommended, not_recommended)
Outputs:
1 0 0 58
2 2 21 53
3 10 46 17
4 47 36 1
5 35 44 2
6 23 48 11
6+ 0 1 46
Or if you want to do it more elegantly find all of each attribute and zip them:
soup = BeautifulSoup(text,'xml')
numplayers = [tag['numplayers'] for tag in soup.find_all('results')]
best = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Best'})]
recommended = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Recommended'})]
not_recommended = [tag['numvotes'] for tag in soup.find_all('result', {'value': 'Not Recommended'})]
print(list(zip(numplayers, best, recommended, not_recommended)))
Outputs:
[('1', '0', '0', '58'), ('2', '2', '21', '53'), ('3', '10', '46', '17'), ('4', '47', '36', '1'), ('5', '35', '44', '2'), ('6', '23', '48', '11'), ('6+', '0', '1', '46')]
Upvotes: 6