gm-123
gm-123

Reputation: 238

How to Convert HTML into List of Python Dictionaries

I need to convert HTML into list of python dictionary. Sample HTML code:

<div data-axite-container="1" data-axite-uuid="1f0c5634-9ff9-4942-861e-2c7e75d6f2ef" data-axite-id="a4d1a127-0fe8-4281-abd6-fe0653a8b519">
<h4>Question 1</h4>
<p>Some text here too</p> <p>And some text here</p>
<h4>Question 2</h4>
<p>Answer Here</p>
<ul><li>text1</li><li>text2</li>
<h4>Question 3</h4>
<p>Answer Here</p><table>...</table>
<h4>Question 4</h4>
<p>Answer</p>

Expected Result:

[   
{
    "q": "Question 1",
    "a": ["Some text here too","And some text here"]
},
{
    "q": "Question 2",
    "a": ["Answer Here","<ul><li>text1</li><li>text2</li>"]
},
{
    "q": "Question 3",
    "a": ["Answer Here","<table>...</table>"]
},
{
    "q": "Question 4",
    "a": ["Answer"]
}]

Any assistance will be highly appreciated. Thanks in advance :)

Upvotes: 0

Views: 2762

Answers (2)

pippo1980
pippo1980

Reputation: 3096

Copied from @Sayeed Hossain exceptionally clear explanatory answer

that use: Docs for the Python HTML parser module

from html.parser import HTMLParser

html_chunk =''' <div data-axite-container="1" data-axite-uuid="1f0c5634-9ff9-4942-861e-2c7e75d6f2ef" data-axite-id="a4d1a127-0fe8-4281-abd6-fe0653a8b519">
<h4>Question 1</h4>
<p>Some text here too</p> <p>And some text here</p>
<h4>Question 2</h4>
<p>Answer Here</p>
<ul><li>text1</li><li>text2</li>
<h4>Question 3</h4>
<p>Answer Here</p><table>...</table>
<h4>Question 4</h4>
<p>Answer</p>'''


class html_to_dict_parser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)

    self.in_word_label = False
    self.in_definition = False
    self.has_definition = False

  def handle_starttag(self, tag, attrs):
    if tag == 'h4':
      self.in_word_label  = True
    elif tag == 'p':
      self.in_definition  = True


  def handle_endtag(self, tag):
    if tag == 'h4':
      self.in_word_label = False
    elif tag == 'p':
      self.in_definition  = False


  def handle_data(self, data):
    if self.in_word_label:
      self.latest_word = data.lower()
      self.has_definition = True
    elif self.in_definition and self.has_definition:
      dictionary=dict()
      dictionary[ "q" ] = self.latest_word
      dictionary[ "a" ] = data
      list_of_dictionary.append(dictionary)
      self.has_definition = False


# create empty list 
list_of_dictionary = []

# Run the parser!
parser = html_to_dict_parser()

parser.feed(html_chunk)

parser.close()
print(list_of_dictionary)

output:

[{'q': 'question 1', 'a': 'Some text here too'}, {'q': 'question 2', 'a': 'Answer Here'}, {'q': 'question 3', 'a': 'Answer Here'}, {'q': 'question 4', 'a': 'Answer'}]

Upvotes: 1

norie
norie

Reputation: 9857

Give this a try, it's not perfect - there's a problem with 'navigable strings' I couldn't resolve satisfactorily.

from bs4 import Tag, NavigableString, BeautifulSoup


html_doc= """
<div data-axite-container="1" data-axite-uuid="1f0c5634-9ff9-4942-861e-2c7e75d6f2ef" data-axite-id="a4d1a127-0fe8-4281-abd6-fe0653a8b519">
<h4>Question 1</h4>
<p>Some text here too</p> <p>And some text here</p>
<h4>Question 2</h4>
<p> Answer Here</p>
<ul><li>text1</li><li>text2</li>
<h4>Question 3</h4>
<p>Answer Here</p><table>...</table>
<h4>Question 4</h4>
<p>Answer</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

questions = soup.select('h4')


lst_questions = []
for tag in questions:
  lst = []
  for x in tag.next_siblings:
    if x.name == 'h4':
      break
    else:
      print(f'{str(x)}-{type(x)}')
      if isinstance(x, Tag):
        lst.append(x.string)
  dic = {'q': tag.string, 'a':lst}
  lst_questions.append(dic)
  
print(lst_questions)

Upvotes: 1

Related Questions