Reputation: 317
I have been given a data set where I need to transform html that is embedded into a cell into a clean html stripped csv. The expected result is presented. Within the html are files that are individually identified and each file needs to be its own row. The columns are in a separate cell and have individual keywords, also embedded in HTML, need to be generated into a new column and identified as TRUE (condition being the keyword is found in the row) or FALSE (condition being the keyword is not found in the row). The solution needs to be sensitive to keywords previously generated and identified as TRUE.
I have been performing searches for similar problems for examples, but this problem seems to be either out of my known technical language (I am not a professional in data cleaning) or the requirements are unusual.
This is a typical row within a CSV...
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">Keyword1</a></li>
<li><a href="">Keyword2</a></li>
</ul>
</div>
","<div id="file"><h3>File</h3>, <div id="image">
<a href="A">A.jpg</a>
<br/></div>
, <div id="image">
<a href="B">B.jpg</a>
<br/></div>
</div>
"
The number of Keywords and Files in each row varies.
Expected result
File, Keyword1, Keyword2, Keyword3
A.jpg, TRUE, TRUE, FALSE
B.jpg, TRUE, TRUE, FALSE
C.jpg, TRUE, FALSE, TRUE
D.jpg, FALSE, FALSE, TRUE
E.jpg, FALSE, FALSE, TRUE
Upvotes: 2
Views: 241
Reputation: 9267
Here is a way to have your desired output using BeautifulSoup
:
from bs4 import BeautifulSoup as bs
a = '''
<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">Keyword1</a></li>
<li><a href="">Keyword2</a></li>
</ul>
</div>
","
<div id="file">
<h3>File</h3>,
<div id="image">
<a href="A">A.jpg</a>
<br/>
</div>
,
<div id="image">
<a href="B">B.jpg</a>
<br/>
</div>
</div>
'''
def find_elms(soup, tag, attribute):
"""Find the block using it's tag and attribute values"""
categories_block = soup.find(tag, attribute)
if categories_block:
return [elm.text for elm in categories_block.findAll('a')]
return []
def pretty_print(master, categories, files):
"""Here we're just better printing the output"""
cat = '\t'.join(['{elm:<12}'.format(elm=elm) for elm in master])
print(cat)
for k in files:
out = '{file_:<12}'.format(file_=k)
cells = '\t'.join(
['{:<12}'.format(str(True if j in categories else False)) for j in master[1:]]
)
print(out, cells)
master_categories = ['File', 'Keyword1', 'Keyword2', 'Keyword3']
soup = bs(a, 'html.parser')
categories = find_elms(soup, 'div', {'id': 'categories'})
files = find_elms(soup, 'div', {'id': 'file'})
pretty_print(master_categories, categories, files)
Output:
File Keyword1 Keyword2 Keyword3
A.jpg True True False
B.jpg True True False
Upvotes: 2