How can I strip all the HTML content from a JSON file?

Question

I want to clean a JSON file of incorrectly extracted HTML content by throwing away all the text which is enclosed in HTML tags, including the tags themselves.

I tried this function:

def stripIt(s):
    txt = re.sub('.*?', '', s)
    return re.sub('\s+', ' ', txt)

but when I applied it to the JSON file, it probably breaks the JSON file, giving some errors.

The HTML content is also broken with missing tags, only closing tags, and so on.

So how can I strip all the HTML content from a JSON file, without breaking the file?

Tomalak · Accepted Answer

How do I strip the html content out from a json file without breaking it?

The same way as with any other serialized data structure. By using a proper parser (and, in this case, a tiny recursive function).

import json
import re

json_string = """{
  "prop_1": {
    "prop_1_1": ["some  data", 17, "more  data"],
    "prop_1_2": "here some , too"
  },
  "prop_2": "and more "
}"""

def unhtml(string):
    # replace ..., possibly more than once
    done = False
    while not done:
        temp = re.sub(r'<([^/]\S*)[^>]*>[\s\S]*?', '', string)
        done = temp == string
        string = temp
    # replace remaining standalone tags, if any
    string = re.sub(r'<[^>]*>', '', string)
    string = re.sub(r'\s{2,}', ' ', string)
    return string.strip()

def cleanup(element):
    if isinstance(element, list):
        for i, item in enumerate(element):
            element[i] = cleanup(item)
    elif isinstance(element, dict):
        for key in element.keys():
            element[key] = cleanup(element[key])
    elif isinstance(element, basestring):
        element = unhtml(element)

    return element

used as

data = json.loads(json_string)
cleanup(data)
json_string = json.dumps(data)
print json_string

The regex to throw out the HTML tags only solves half the problem. All character entities (like & or < will remain in the string.

Rewrite unhtml() to use a proper parser, too.

How can I strip all the HTML content from a JSON file?

Answers (2)

Related Questions