Reputation: 238
I need to convert HTML into list of python dictionary. Sample HTML code:
<div data-axite-container="1" data-axite-uuid="1f0c5634-9ff9-4942-861e-2c7e75d6f2ef" data-axite-id="a4d1a127-0fe8-4281-abd6-fe0653a8b519">
<h4>Question 1</h4>
<p>Some text here too</p> <p>And some text here</p>
<h4>Question 2</h4>
<p>Answer Here</p>
<ul><li>text1</li><li>text2</li>
<h4>Question 3</h4>
<p>Answer Here</p><table>...</table>
<h4>Question 4</h4>
<p>Answer</p>
Expected Result:
[
{
"q": "Question 1",
"a": ["Some text here too","And some text here"]
},
{
"q": "Question 2",
"a": ["Answer Here","<ul><li>text1</li><li>text2</li>"]
},
{
"q": "Question 3",
"a": ["Answer Here","<table>...</table>"]
},
{
"q": "Question 4",
"a": ["Answer"]
}]
Any assistance will be highly appreciated. Thanks in advance :)
Upvotes: 0
Views: 2762
Reputation: 3096
Copied from @Sayeed Hossain exceptionally clear explanatory answer
that use: Docs for the Python HTML parser module
from html.parser import HTMLParser
html_chunk =''' <div data-axite-container="1" data-axite-uuid="1f0c5634-9ff9-4942-861e-2c7e75d6f2ef" data-axite-id="a4d1a127-0fe8-4281-abd6-fe0653a8b519">
<h4>Question 1</h4>
<p>Some text here too</p> <p>And some text here</p>
<h4>Question 2</h4>
<p>Answer Here</p>
<ul><li>text1</li><li>text2</li>
<h4>Question 3</h4>
<p>Answer Here</p><table>...</table>
<h4>Question 4</h4>
<p>Answer</p>'''
class html_to_dict_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.in_word_label = False
self.in_definition = False
self.has_definition = False
def handle_starttag(self, tag, attrs):
if tag == 'h4':
self.in_word_label = True
elif tag == 'p':
self.in_definition = True
def handle_endtag(self, tag):
if tag == 'h4':
self.in_word_label = False
elif tag == 'p':
self.in_definition = False
def handle_data(self, data):
if self.in_word_label:
self.latest_word = data.lower()
self.has_definition = True
elif self.in_definition and self.has_definition:
dictionary=dict()
dictionary[ "q" ] = self.latest_word
dictionary[ "a" ] = data
list_of_dictionary.append(dictionary)
self.has_definition = False
# create empty list
list_of_dictionary = []
# Run the parser!
parser = html_to_dict_parser()
parser.feed(html_chunk)
parser.close()
print(list_of_dictionary)
output:
[{'q': 'question 1', 'a': 'Some text here too'}, {'q': 'question 2', 'a': 'Answer Here'}, {'q': 'question 3', 'a': 'Answer Here'}, {'q': 'question 4', 'a': 'Answer'}]
Upvotes: 1
Reputation: 9857
Give this a try, it's not perfect - there's a problem with 'navigable strings' I couldn't resolve satisfactorily.
from bs4 import Tag, NavigableString, BeautifulSoup
html_doc= """
<div data-axite-container="1" data-axite-uuid="1f0c5634-9ff9-4942-861e-2c7e75d6f2ef" data-axite-id="a4d1a127-0fe8-4281-abd6-fe0653a8b519">
<h4>Question 1</h4>
<p>Some text here too</p> <p>And some text here</p>
<h4>Question 2</h4>
<p> Answer Here</p>
<ul><li>text1</li><li>text2</li>
<h4>Question 3</h4>
<p>Answer Here</p><table>...</table>
<h4>Question 4</h4>
<p>Answer</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
questions = soup.select('h4')
lst_questions = []
for tag in questions:
lst = []
for x in tag.next_siblings:
if x.name == 'h4':
break
else:
print(f'{str(x)}-{type(x)}')
if isinstance(x, Tag):
lst.append(x.string)
dic = {'q': tag.string, 'a':lst}
lst_questions.append(dic)
print(lst_questions)
Upvotes: 1