Extract part of Text in HTML using Beautiful Soup

Question

I have HTML as :


    Division : First; Grand Total: 3861; Grand Max Total: 4600

I can extract the text : Division : First; Grand Total: 3861; Grand Max Total: 4600 by using the get_text on the span element.

Is it possible to extract just the numbers - 3861 and 4600 from the text or get the characters(numbers) by skipping the alphabets using Beautiful Soup library in Python?

Dom Weldon · Accepted Answer

If your data is regular, and by the looks of it, it's key-value pairs separated by semi-colons. The function below will extract that into key-value tuples. You could then go through and extract only rows where there are numbers using something like the below.

def extract_kv_pairs(s):
    """Extract key value pairs seperated by colons and semi-colons."""
    kvp = []
    for r in s.split(';'):
        k, v = r.split(':')
        # is it an integer?
        try:
            # yes, convert it
            v = int(v)
        except ValueError:
            # no, trim the string
            v = v.strip()

        kvp.append((k.strip(), v))

    return kvp

s = 'Division : First; Grand Total: 3861; Grand Max Total: 4600'
kvp = extract_kv_pairs(s)
numeric_values = [p for p in kvp if isinstance(p[1], int)]
print(kvp)
# [('Division', 'First'), ('Grand Total', 3861), ('Grand Max Total', 4600)]
print(numeric_values)
# [('Grand Total', 3861), ('Grand Max Total', 4600)]

Extract part of Text in HTML using Beautiful Soup

Answers (1)

Related Questions