Tobias
Tobias

Reputation: 135

Making a pandas dataframe out of HTML

I am trying to convert word documents .docx to a dataframe. These docx files are first converted to HTML using the following:

#fill path in function 
path = os.chdir('C:jan_2021')
filename = "newsupdatedocx"
regex = '\xc2\xb7'

with open(filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    text = result.value # The raw text
    text2=re.sub(u'[|•●]', " ", text, count= 0) 
    with open('output.txt', 'w', encoding='utf-8') as text_file:
        text_file.write(text2)
               

This provides the following HTML output:

print(prettify)
<html>
 <body>
  <p>
   Newsupdate of date 01-01-2021
  </p>
  <h1>
   Header - worldwide news - category nr 1.
  </h1>
  <h2>
   Header - title of article nr. 1
  </h2>
  <p>
   Source: economist, google, NYTimes
  </p>
  <ul>
   <li>
    First bullet point related to article 1
   </li>
   <li>
    Second bullet point
   </li>
 </p>
 </body>

As you can see, converting the document to HTML provides a structure to it that can be analyzed accordingly. Now I would like to convert this to a dataframe. I want to make a list out of all elements in order to iterate over the list and to check if it is a <h1> element or <h2> elements in an item of a list. Ultimately, I want to have a dataframe with columns: date , type of news , article title , source before list items and lastly, items (bullet points).

The first step is to convert all elements to a list, so is there a certain function that converts this HTML to a list?

Upvotes: 0

Views: 150

Answers (1)

Tobias
Tobias

Reputation: 135

I found the answer, using the following code:

tags = soup.find_all(['p', 'h1', 'h2', 'li'])

Upvotes: 1

Related Questions