Reputation: 135
I am trying to convert word documents .docx
to a dataframe. These docx files are first converted to HTML using the following:
#fill path in function
path = os.chdir('C:jan_2021')
filename = "newsupdatedocx"
regex = '\xc2\xb7'
with open(filename, "rb") as docx_file:
result = mammoth.convert_to_html(docx_file)
text = result.value # The raw text
text2=re.sub(u'[|•●]', " ", text, count= 0)
with open('output.txt', 'w', encoding='utf-8') as text_file:
text_file.write(text2)
This provides the following HTML output:
print(prettify)
<html>
<body>
<p>
Newsupdate of date 01-01-2021
</p>
<h1>
Header - worldwide news - category nr 1.
</h1>
<h2>
Header - title of article nr. 1
</h2>
<p>
Source: economist, google, NYTimes
</p>
<ul>
<li>
First bullet point related to article 1
</li>
<li>
Second bullet point
</li>
</p>
</body>
As you can see, converting the document to HTML provides a structure to it that can be analyzed accordingly. Now I would like to convert this to a dataframe. I want to make a list out of all elements in order to iterate over the list and to check if it is a <h1>
element or <h2>
elements in an item of a list. Ultimately, I want to have a dataframe with columns: date , type of news , article title , source before list items and lastly, items (bullet points).
The first step is to convert all elements to a list, so is there a certain function that converts this HTML to a list?
Upvotes: 0
Views: 150
Reputation: 135
I found the answer, using the following code:
tags = soup.find_all(['p', 'h1', 'h2', 'li'])
Upvotes: 1