Reputation: 3106
I want to clean a JSON file of incorrectly extracted HTML content by throwing away all the text which is enclosed in HTML tags, including the tags themselves.
I tried this function:
def stripIt(s):
txt = re.sub('</?[^<]+?>.*?</[^<]+?>', '', s)
return re.sub('\s+', ' ', txt)
but when I applied it to the JSON file, it probably breaks the JSON file, giving some errors.
The HTML content is also broken with missing tags, only closing tags, and so on.
So how can I strip all the HTML content from a JSON file, without breaking the file?
Upvotes: 1
Views: 5371
Reputation: 697
I am assuming here that you are trying to remove the HTML from the JSON object values.
Loading the JSON object and extracting the object value and then convert to string, which prevents any error due to Unicode character conversion:
import json
import re
with open('File_Name', encoding="utf8") as jsonFile:
data = json.load(jsonFile)
string = str(*JSON_Object_Value*)
For stripping out the HTML tags from the string value of the JSON object and replacing them with a space character (" "):
clean = re.compile('<.*?>')
string = re.sub(clean, " ", string)
For stripping out the Hexadecimal number for any character representation from the string value of the JSON object and replacing them with a space character (" "):
clean = re.compile('&.*?;')
string = re.sub(clean, " ", string)
Instead of the space character, you can replace them with any other desired character too.
Upvotes: 0
Reputation: 338336
How do I strip the html content out from a json file without breaking it?
The same way as with any other serialized data structure. By using a proper parser (and, in this case, a tiny recursive function).
import json
import re
json_string = """{
"prop_1": {
"prop_1_1": ["some <html> data", 17, "more <html> data"],
"prop_1_2": "here some <html>, too"
},
"prop_2": "and more <html>"
}"""
def unhtml(string):
# replace <tag>...</tag>, possibly more than once
done = False
while not done:
temp = re.sub(r'<([^/]\S*)[^>]*>[\s\S]*?</\1>', '', string)
done = temp == string
string = temp
# replace remaining standalone tags, if any
string = re.sub(r'<[^>]*>', '', string)
string = re.sub(r'\s{2,}', ' ', string)
return string.strip()
def cleanup(element):
if isinstance(element, list):
for i, item in enumerate(element):
element[i] = cleanup(item)
elif isinstance(element, dict):
for key in element.keys():
element[key] = cleanup(element[key])
elif isinstance(element, basestring):
element = unhtml(element)
return element
used as
data = json.loads(json_string)
cleanup(data)
json_string = json.dumps(data)
print json_string
The regex to throw out the HTML tags only solves half the problem. All character entities (like &
or <
will remain in the string.
Rewrite unhtml()
to use a proper parser, too.
Upvotes: 5