Reputation: 223
I am trying to obtain a hierarchical structure of sections, sub-sections, sub-sub-sections in a Wikipedia page.
I have a string like this:
mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
In this case the page name is 'a' and the structure is following
= b =
= c =
== d ==
== e ==
=== f ===
=== g ===
==== h ====
=== i ===
== j ==
== k ==
= l =
The equality signs are indicators of section or sub-section and so on. I need to obtain a python list containing all the relational hierarchical structures like this:
mylist = ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g',
'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']
So far I have been able to find the sections, sub-sections and so on by doing this:
sections = re.findall(r' = (.*?)\ =', mystr)
subsections = re.findall(r' == (.*?)\ ==', mystr)
...
But I don't know how to proceed from here to get the desired mylist.
Upvotes: 1
Views: 1014
Reputation: 24282
You can do it like this:
- the first function parses your string, and yields tokens (level, name) like (0, 'a'), (1, 'b')
- the second one builds the tree from there.
import re
def tokens(string):
# The root name doesn't respect the '= name =' convention,
# so we cut the string on the first " = " and yield the root name
root_end = string.index(' = ')
root, rest = string[:root_end], string[root_end:]
yield 0, root
# We use a regex for the next tokens, who consist of the following groups:
# - any number of "=" followed by 0 or more spaces,
# - the name, not containing any =
# - and again, the first group of "=..."
tokens_re = re.compile(r'(=+ ?)([^=]+)\1')
# findall will return a list:
# [('= ', 'b '), ('= ', 'c '), ('== ', 'd '), ('== ', 'e '), ('=== ', 'f '), ...]
for token in tokens_re.findall(rest):
level = token[0].count('=')
name = token[1].strip()
yield level, name
def tree(token_list):
out = []
# We keep track of the current position in the hierarchy:
hierarchy = []
for token in token_list:
level, name = token
# We cut the hierarchy below the level of our token
hierarchy = hierarchy[:level]
# and append the current one
hierarchy.append(name)
out.append('/'.join(hierarchy))
return out
mystr = 'a = b = = c = == d == == e == === f === === g === ==== h ==== === i === == j == == k == = l ='
out = tree(tokens(mystr))
# Check that this is your expected output
assert out == ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g',
'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']
print(out)
# ['a', 'a/b', 'a/c', 'a/c/d', 'a/c/e', 'a/c/e/f', 'a/c/e/g', 'a/c/e/g/h', 'a/c/e/i', 'a/c/j', 'a/c/k', 'a/l']
Upvotes: 1