Ninja Warrior 11
Ninja Warrior 11

Reputation: 372

Python, BeautifulSoup, re: How to convert extracted texts to dictionary from web?

I made a script using BeautifulSoup to extract a certain information from the web. The only problem is that I don't know how to convert the results to dictionary and if I do the code will be like a spaghetti. I am not sure if this code I wrote is acceptable to be Pythonic. The last item Species should be Binomial nomenclature like "Lycaon pictus" while strings after "pictus" should be ignored. Need some assistance.

script

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re

url = "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183833#null"
page = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
soup = BeautifulSoup(page, 'html.parser')
results = soup.findAll('tr')
for result in results:
    text = result.get_text().strip()
    pattern = r"^(Kingdom|Phylum|Division|Class|Order|Family|Genus|Species)[\w]+"
    if re.match(pattern, text):
        res = text.split('\n', 1)[0].strip()
        print(res)

output from script

KingdomAnimalia
PhylumChordata
ClassMammalia Linnaeus, 1758
OrderCarnivora Bowdich, 1821
FamilyCanidae Fischer, 1817
GenusLycaon Brookes, 1827
SpeciesLycaon pictus (Temminck, 1820) – African hunting dog, African Wild Dog, Painted Hunting Dog

expected result

{
    'Kingdom': 'Animalia',
    'Phylum': 'Chordata',
    'Class': 'Mammalia',
    'Order': 'Carnivora',
    'Family': 'Canidae',
    'Genus': 'Lycaon',
    'Species': 'Lycaon pictus'
}

Upvotes: 1

Views: 1403

Answers (2)

scrpy
scrpy

Reputation: 1021

For the specific example given, this works:

...
results = soup.findAll('tr')
my_dict = {}
for result in results:
    text = result.get_text().strip()
    pattern = r"^(Kingdom|Phylum|Division|Class|Order|Family|Genus|Species)[\w]+"
    if re.match(pattern, text):
        res = text.split('\n', 1)[0].strip()
        pieces = re.findall(r'[A-Z][ a-z]*', res)
        my_dict[pieces[0]] = pieces[1]
print(my_dict)

Output:

{'Kingdom': 'Animalia', 'Phylum': 'Chordata', 'Class': 'Mammalia',
 'Order': 'Carnivora', 'Family': 'Canidae', 'Genus': 'Lycaon',
 'Species': 'Lycaon pictus'}

This relies heavily on the exact formatting given in the example above. For example, if the website had 'Lycaon Pictus' with a captial 'P' for the 'Species', then the corresponding entry in the dictionary would be just 'Lycaon' instead of 'Lycaon Pictus'.

Upvotes: 1

Phairero
Phairero

Reputation: 76

"result" here is someting like

<td align="left" class="body" width="2%"> </td>
<td align="left" class="body" valign="top" width="24%">Kingdom</td>
<td class="datafield" valign="top" width="71%"><a href="SingleRpt?search_topic=TSN&amp;search_value=202423">Animalia</a> 
 – Animal, animaux, animals</td>
<td class="body" width="5%"> </td>

When you use .get_text() on it, it turns to

'\xa0KingdomAnimalia\xa0\n – Animal, animaux, animals\n\xa0'

So when matching, you should use your old 'result' and split the columns up. For example:

if re.match(pattern, text)) :
    pieces = result.findAll('td')

and then use those pieces to find your information, for example

for p in pieces:
    print(p.get_text())

Of course, you cannot expect it to return dictionary, when you are working with strings and are not making-mapping one in the first place. Thus you should make one before starting for-loop, let's call it dictionary

if re.match(pattern, text):
    p = result.findAll('td')
    rank = p[1].get_text().strip()
    taxon = p[2].get_text().split('\xa0')[0]
    dictionary[rank] = taxon

This would get you the dictionary you are looking for

Upvotes: 1

Related Questions