Aaron
Aaron

Reputation: 317

How do to reformat a CSV with raw html into a cleaned data set csv?

I have been given a data set where I need to transform html that is embedded into a cell into a clean html stripped csv. The expected result is presented. Within the html are files that are individually identified and each file needs to be its own row. The columns are in a separate cell and have individual keywords, also embedded in HTML, need to be generated into a new column and identified as TRUE (condition being the keyword is found in the row) or FALSE (condition being the keyword is not found in the row). The solution needs to be sensitive to keywords previously generated and identified as TRUE.

I have been performing searches for similar problems for examples, but this problem seems to be either out of my known technical language (I am not a professional in data cleaning) or the requirements are unusual.

This is a typical row within a CSV...

    "<div id="categories">
    <h3>Categories</h3>
    <ul>
    <li><a href="">Keyword1</a></li>
    <li><a href="">Keyword2</a></li>
    </ul>
    </div>
    ","<div id="file"><h3>File</h3>, <div id="image">
    <a href="A">A.jpg</a>
    <br/></div>
    ,  <div id="image">
    <a href="B">B.jpg</a>
    <br/></div>
    </div>
    "

The number of Keywords and Files in each row varies.

Expected result

File, Keyword1, Keyword2, Keyword3
A.jpg, TRUE, TRUE, FALSE
B.jpg, TRUE, TRUE, FALSE
C.jpg, TRUE, FALSE, TRUE
D.jpg, FALSE, FALSE, TRUE
E.jpg, FALSE, FALSE, TRUE

Upvotes: 2

Views: 241

Answers (1)

Chiheb Nexus
Chiheb Nexus

Reputation: 9267

Here is a way to have your desired output using BeautifulSoup:

from bs4 import BeautifulSoup as bs


a = '''
    <div id="categories">
        <h3>Categories</h3>
        <ul>
            <li><a href="">Keyword1</a></li>
            <li><a href="">Keyword2</a></li>
        </ul>
    </div>
    ","
    <div id="file">
        <h3>File</h3>,
        <div id="image">
            <a href="A">A.jpg</a>
            <br/>
        </div>
        ,
        <div id="image">
            <a href="B">B.jpg</a>
            <br/>
        </div>
    </div>
'''


def find_elms(soup, tag, attribute):
    """Find the block using it's tag and attribute values"""
    categories_block = soup.find(tag, attribute)
    if categories_block:
        return [elm.text for elm in categories_block.findAll('a')]
    return []


def pretty_print(master, categories, files):
    """Here we're just better printing the output"""
    cat = '\t'.join(['{elm:<12}'.format(elm=elm) for elm in master])
    print(cat)
    for k in files:
        out = '{file_:<12}'.format(file_=k)
        cells = '\t'.join(
            ['{:<12}'.format(str(True if j in categories else False)) for j in master[1:]]
        )
        print(out, cells)


master_categories = ['File', 'Keyword1', 'Keyword2', 'Keyword3']
soup = bs(a, 'html.parser')
categories = find_elms(soup, 'div', {'id': 'categories'})
files = find_elms(soup, 'div', {'id': 'file'})
pretty_print(master_categories, categories, files)

Output:

File            Keyword1        Keyword2        Keyword3    
A.jpg        True           True            False       
B.jpg        True           True            False 

Upvotes: 2

Related Questions