How to store the HTML within an opening and closing tag with Python

Question

I am reading in an HTML document and want to store the HTML nested within a div tag of a certain name, while maintaining its structure (the spacing). This is for the ability convert an HTML doc into components for React. I am struggling with how to store the structure of the nested HTML, and locate the correct closing tag for the div the denotes that everything nested within it will become a React component (div class='rc-componentname' is the opening tag). Any help would be very appreciated. Thanks!

Edit: I assume regex are the best way to go about this. I haven't used regex before so if that is correct someone could point me in the right direction for the expression used in this context that would be great.

import os

components = []

class react_template():
    def __init__(self, component_name): # add nested html as second element
        self.Import = "import React, { Component } from ‘react’;"
        self.Class = "Class " + component_name + ' extends Component {'
        self.Render = "render() {"
        self.Return = "return "
        self.Export = "Default export " + component_name + ";"

def react(component):
    r = react_template(component)

    if not os.path.exists('components'): # create components folder
        os.mkdir('components')
    os.chdir('components')

    if not os.path.exists(component): # create folder for component
        os.mkdir(component)
    os.chdir(component)

    with open(component + '.js', 'wb') as f: # create js component file
        for j_key, j_code in r.__dict__.items():
            f.write(j_code.encode('utf-8') + '
'.encode('utf-8'))
    f.close()


def process_html():
    with open('file.html', 'r') as f:
        for line in f:
            if 'rc-' in line:
                char_soup = list(line)
                for index, char in enumerate(char_soup):
                    if char == 'r' and char_soup[index+1] == 'c' and char_soup[index+2] == '-':
                        sliced_soup = char_soup[int(index+3):]
                        c_slice_index = sliced_soup.index("\'")
                        component = "".join(sliced_soup[:c_slice_index])
                        components.append(component)
                        innerHTML(sliced_soup)
                        # react(component)

def innerHTML(sliced_soup): # work in progress
    first_closing = sliced_soup.index(">")
    sliced_soup = "".join(sliced_soup[first_closing:]).split(" ")


def generate_components(components):
    for c in components:
        react(c)


if __name__ == "__main__":
    process_html()

burling · Accepted Answer

I see you've used the word soup in your code... maybe you've already tried and disliked BeautifulSoup? If you haven't tried it, I'd recommend you look at BeautifulSoup instead of attempting to parse HTML with regex. Although regex would be sufficient for a single tag or even a handful of tags, markup languages are deceptively simple. BeautifulSoup is a fine library and can make things easier for dealing with markup.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

This will allow you to treat the entirety of your html as a single object and enable you to:

# create a list of specific elements as objects
soup.find_all('div')

# find a specific element by id
soup.find(id="custom-header")

How to store the HTML within an opening and closing tag with Python

Answers (1)

Related Questions