Making a pandas dataframe out of HTML

Question

I am trying to convert word documents .docx to a dataframe. These docx files are first converted to HTML using the following:

#fill path in function 
path = os.chdir('C:jan_2021')
filename = "newsupdatedocx"
regex = '\xc2\xb7'

with open(filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    text = result.value # The raw text
    text2=re.sub(u'[|•●]', " ", text, count= 0) 
    with open('output.txt', 'w', encoding='utf-8') as text_file:
        text_file.write(text2)

This provides the following HTML output:

print(prettify)

 
  
   Newsupdate of date 01-01-2021
  
  
   Header - worldwide news - category nr 1.
  
  
   Header - title of article nr. 1
  
  
   Source: economist, google, NYTimes
  
  
   
    First bullet point related to article 1
   
   
    Second bullet point

As you can see, converting the document to HTML provides a structure to it that can be analyzed accordingly. Now I would like to convert this to a dataframe. I want to make a list out of all elements in order to iterate over the list and to check if it is a

element or
elements in an item of a list. Ultimately, I want to have a dataframe with columns: date , type of news , article title , source before list items and lastly, items (bullet points).

The first step is to convert all elements to a list, so is there a certain function that converts this HTML to a list?

Making a pandas dataframe out of HTML

Answers (1)

Related Questions