stackit
stackit

Reputation: 3106

How can I strip all the HTML content from a JSON file?

I want to clean a JSON file of incorrectly extracted HTML content by throwing away all the text which is enclosed in HTML tags, including the tags themselves.

I tried this function:

def stripIt(s):
    txt = re.sub('</?[^<]+?>.*?</[^<]+?>', '', s)
    return re.sub('\s+', ' ', txt)

but when I applied it to the JSON file, it probably breaks the JSON file, giving some errors.

The HTML content is also broken with missing tags, only closing tags, and so on.

So how can I strip all the HTML content from a JSON file, without breaking the file?

Upvotes: 1

Views: 5371

Answers (2)

Srijan Chaudhary
Srijan Chaudhary

Reputation: 697

I am assuming here that you are trying to remove the HTML from the JSON object values.

Loading the JSON object and extracting the object value and then convert to string, which prevents any error due to Unicode character conversion:

import json
import re 

with open('File_Name', encoding="utf8") as jsonFile:    
        data = json.load(jsonFile)
        string = str(*JSON_Object_Value*)

For stripping out the HTML tags from the string value of the JSON object and replacing them with a space character (" "):

clean = re.compile('<.*?>')
string = re.sub(clean, " ", string)

For stripping out the Hexadecimal number for any character representation from the string value of the JSON object and replacing them with a space character (" "):

clean = re.compile('&.*?;')
string = re.sub(clean, " ", string)

Instead of the space character, you can replace them with any other desired character too.

Upvotes: 0

Tomalak
Tomalak

Reputation: 338336

How do I strip the html content out from a json file without breaking it?

The same way as with any other serialized data structure. By using a proper parser (and, in this case, a tiny recursive function).

import json
import re

json_string = """{
  "prop_1": {
    "prop_1_1": ["some <html> data", 17, "more <html> data"],
    "prop_1_2": "here some <html>, too"
  },
  "prop_2": "and more <html>"
}"""

def unhtml(string):
    # replace <tag>...</tag>, possibly more than once
    done = False
    while not done:
        temp = re.sub(r'<([^/]\S*)[^>]*>[\s\S]*?</\1>', '', string)
        done = temp == string
        string = temp
    # replace remaining standalone tags, if any
    string = re.sub(r'<[^>]*>', '', string)
    string = re.sub(r'\s{2,}', ' ', string)
    return string.strip()

def cleanup(element):
    if isinstance(element, list):
        for i, item in enumerate(element):
            element[i] = cleanup(item)
    elif isinstance(element, dict):
        for key in element.keys():
            element[key] = cleanup(element[key])
    elif isinstance(element, basestring):
        element = unhtml(element)

    return element

used as

data = json.loads(json_string)
cleanup(data)
json_string = json.dumps(data)
print json_string

The regex to throw out the HTML tags only solves half the problem. All character entities (like &amp; or &lt; will remain in the string.

Rewrite unhtml() to use a proper parser, too.

Upvotes: 5

Related Questions