Python, BeautifulSoup, re: How to convert extracted texts to dictionary from web?

Question

I made a script using BeautifulSoup to extract a certain information from the web. The only problem is that I don't know how to convert the results to dictionary and if I do the code will be like a spaghetti. I am not sure if this code I wrote is acceptable to be Pythonic. The last item Species should be Binomial nomenclature like "Lycaon pictus" while strings after "pictus" should be ignored. Need some assistance.

script

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re

url = "https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=183833#null"
page = urlopen(Request(url, headers={'User-Agent': 'Mozilla/5.0'}))
soup = BeautifulSoup(page, 'html.parser')
results = soup.findAll('tr')
for result in results:
    text = result.get_text().strip()
    pattern = r"^(Kingdom|Phylum|Division|Class|Order|Family|Genus|Species)[\w]+"
    if re.match(pattern, text):
        res = text.split('
', 1)[0].strip()
        print(res)

output from script

KingdomAnimalia
PhylumChordata
ClassMammalia Linnaeus, 1758
OrderCarnivora Bowdich, 1821
FamilyCanidae Fischer, 1817
GenusLycaon Brookes, 1827
SpeciesLycaon pictus (Temminck, 1820) – African hunting dog, African Wild Dog, Painted Hunting Dog

expected result

{
    'Kingdom': 'Animalia',
    'Phylum': 'Chordata',
    'Class': 'Mammalia',
    'Order': 'Carnivora',
    'Family': 'Canidae',
    'Genus': 'Lycaon',
    'Species': 'Lycaon pictus'
}

scrpy · Accepted Answer

For the specific example given, this works:

...
results = soup.findAll('tr')
my_dict = {}
for result in results:
    text = result.get_text().strip()
    pattern = r"^(Kingdom|Phylum|Division|Class|Order|Family|Genus|Species)[\w]+"
    if re.match(pattern, text):
        res = text.split('
', 1)[0].strip()
        pieces = re.findall(r'[A-Z][ a-z]*', res)
        my_dict[pieces[0]] = pieces[1]
print(my_dict)

Output:

{'Kingdom': 'Animalia', 'Phylum': 'Chordata', 'Class': 'Mammalia',
 'Order': 'Carnivora', 'Family': 'Canidae', 'Genus': 'Lycaon',
 'Species': 'Lycaon pictus'}

This relies heavily on the exact formatting given in the example above. For example, if the website had 'Lycaon Pictus' with a captial 'P' for the 'Species', then the corresponding entry in the dictionary would be just 'Lycaon' instead of 'Lycaon Pictus'.

Python, BeautifulSoup, re: How to convert extracted texts to dictionary from web?

script

output from script

expected result

Answers (2)

Related Questions