user8101320
user8101320

Reputation: 223

Obtain hierarchical structure from python string

I am trying to obtain a hierarchical structure of sections, sub-sections, sub-sub-sections in a Wikipedia page.

I have a string like this:

mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='

In this case the page name is 'a' and the structure is following

= b =
= c =
  == d ==
  == e ==
     === f ===
     === g ===
         ==== h ====
     === i ===
  == j ==
  == k ==
= l =

The equality signs are indicators of section or sub-section and so on. I need to obtain a python list containing all the relational hierarchical structures like this:

mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

So far I have been able to find the sections, sub-sections and so on by doing this:

sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...

But I don't know how to proceed from here to get the desired mylist.

Upvotes: 1

Views: 1014

Answers (1)

Thierry Lathuille
Thierry Lathuille

Reputation: 24282

You can do it like this:
- the first function parses your string, and yields tokens (level, name) like (0, 'a'), (1, 'b')
- the second one builds the tree from there.

import re

def tokens(string):
    # The root name doesn't respect the '= name =' convention,
    # so we cut the string on the first " = " and yield the root name
    root_end = string.index(' = ') 
    root, rest = string[:root_end], string[root_end:]
    yield 0, root

    # We use a regex for the next tokens, who consist of the following groups:
    # - any number of "=" followed by 0 or more spaces,
    # - the name, not containing any =
    # - and again, the first group of "=..."

    tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
    # findall will return a list:
    # [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
    for token in tokens_re.findall(rest):
        level = token[0].count('=')
        name = token[1].strip()
        yield level, name


def tree(token_list):    
    out = []
    # We keep track of the current position in the hierarchy:
    hierarchy = []
    for token in token_list:
        level, name = token
        # We cut the hierarchy below the level of our token
        hierarchy = hierarchy[:level]
        # and append the current one
        hierarchy.append(name)
        out.append('/'.join(hierarchy))
    return out


mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 
          'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']

Upvotes: 1

Related Questions