How to extract and ignore span in markup? - python

Question

How to extract and ignore span in HTML markup?

My input looks like this:


noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer

Desired outputs:

label = 'noun' # String embedded between ...
meaning = 'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer' # the text without the string embedded within ...
related_to = ['sale', 'chain', 'wine'] # String embedded between ...
utag = ['product'] # String embedded between ...

I've tried this:

>>> from bs4 import BeautifulSoup
>>> text = '''
...     noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'''
>>> bsoup = BeautifulSoup(text)
>>> bsoup.text
u'
noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'

# Getting the `label`
>>> label = bsoup.find('span')
>>> label
noun
>>> label = bsoup.find('span').text
>>> label
u'noun'

# Getting the text.
>>> bsoup.text.strip()
u'noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'
>>> bsoup.text.strip
>>> definition = bsoup.text.strip() 
>>> definition = definition.partition(' ')[2] if definition.split()[0] == label else definition
>>> definition
u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer'

# Getting the related_to and utag
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> related_to = [r.text for r in bsoup.find_all('u')]
>>> related_to = [r.text for r in bsoup.find_all('a')]
>>> utag = [r.text for r in bsoup.find_all('u')]
>>> related_to
[u'sale', u'chain', u'wine']
>>> utag
[u'product']

Using BeautifulSoup is okay but it's a little verbose to get what's needed.

Is there any other to achieve the same outputs?

Is there a regex way with some groups to catch the desired outputs?

alecxe · Accepted Answer

It still has a pretty well-formed structure and you've stated the set of rules clearly. I would still approach it with BeautifulSoup applying the "Extract Method" refactoring method:

from pprint import pprint
from bs4 import BeautifulSoup


data = """

noun the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer
"""

def get_info(elm):
    label = elm.find("span")
    return {
        "label": label.text,
        "meaning": "".join(getattr(sibling, "text", sibling) for sibling in label.next_siblings).strip(),
        "related_to": [a.text for a in elm.find_all("a")],
        "utag": [u.text for u in elm.find_all("u")]
    }

soup = BeautifulSoup(data, "html.parser")
pprint(get_info(soup.li))

Prints:

{'label': u'noun',
 'meaning': u'the joining together of businesses which deal with different stages in the production or sale of the same product, as when a restaurant chain takes over a wine importer',
 'related_to': [u'sale', u'chain', u'wine'],
 'utag': [u'product']}

How to extract and ignore span in markup? - python

Answers (2)

Related Questions