How do to reformat a CSV with raw html into a cleaned data set csv?

Question

I have been given a data set where I need to transform html that is embedded into a cell into a clean html stripped csv. The expected result is presented. Within the html are files that are individually identified and each file needs to be its own row. The columns are in a separate cell and have individual keywords, also embedded in HTML, need to be generated into a new column and identified as TRUE (condition being the keyword is found in the row) or FALSE (condition being the keyword is not found in the row). The solution needs to be sensitive to keywords previously generated and identified as TRUE.

I have been performing searches for similar problems for examples, but this problem seems to be either out of my known technical language (I am not a professional in data cleaning) or the requirements are unusual.

This is a typical row within a CSV...

    "
    Categories
    
    Keyword1
    Keyword2
    
    
    ","File
, 
    A.jpg
    

    ,  
    B.jpg
    

    
    "

The number of Keywords and Files in each row varies.

Expected result

File, Keyword1, Keyword2, Keyword3
A.jpg, TRUE, TRUE, FALSE
B.jpg, TRUE, TRUE, FALSE
C.jpg, TRUE, FALSE, TRUE
D.jpg, FALSE, FALSE, TRUE
E.jpg, FALSE, FALSE, TRUE

Chiheb Nexus · Accepted Answer

Here is a way to have your desired output using BeautifulSoup:

from bs4 import BeautifulSoup as bs


a = '''
    
        Categories
        
            Keyword1
            Keyword2
        
    
    ","
    
        File,
        
            A.jpg
            

        
        ,
        
            B.jpg
            

        
    
'''


def find_elms(soup, tag, attribute):
    """Find the block using it's tag and attribute values"""
    categories_block = soup.find(tag, attribute)
    if categories_block:
        return [elm.text for elm in categories_block.findAll('a')]
    return []


def pretty_print(master, categories, files):
    """Here we're just better printing the output"""
    cat = '	'.join(['{elm:<12}'.format(elm=elm) for elm in master])
    print(cat)
    for k in files:
        out = '{file_:<12}'.format(file_=k)
        cells = '	'.join(
            ['{:<12}'.format(str(True if j in categories else False)) for j in master[1:]]
        )
        print(out, cells)


master_categories = ['File', 'Keyword1', 'Keyword2', 'Keyword3']
soup = bs(a, 'html.parser')
categories = find_elms(soup, 'div', {'id': 'categories'})
files = find_elms(soup, 'div', {'id': 'file'})
pretty_print(master_categories, categories, files)

Output:

File            Keyword1        Keyword2        Keyword3    
A.jpg        True           True            False       
B.jpg        True           True            False

How do to reformat a CSV with raw html into a cleaned data set csv?

Answers (1)

Related Questions