Reputation: 34784

How to parse hierarchy based on indents with python

I have an accounting tree that's stored with indents/spaces in the source:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

There are a fixed number of levels, so I'd like to flatten the hierarchy by using 3 fields (actual data has 6 levels, simplified for example):

L1       L2            L3
Income
Income   Revenue
Income   Revenue       IAP
Income   Revenue       Ads
Income   Other-Income
Expenses Developers    In-house
 ... etc

I can do this by checking the number of spaces prior to the account name:

for rownum in range(6,ws.max_row+1):
   accountName = str(ws.cell(row=rownum,column=1).value)
   indent = len(accountName) - len(accountName.lstrip(' '))
   if indent == 0:
      l1 = accountName
      l2 = ''
      l3 = ''
   elif indent == 3:
      l2 = accountName
      l3 = ''
   else:
      l3 = accountName

   w.writerow([l1,l2,l3])

Is there a more flexible way to achieve this based on the indentation of the current row compared to the previous row rather than assuming it's always 3 spaces per level? L1 will always have no indent, and we can trust that lower levels will be indented further than their parent, but maybe not always 3 spaces per level.

Update, ended up with this as the meat of the logic, since I ultimately wanted the account list with the content, it seemed easiest to just use the indent to decide whether to reset, append, or pop the list:

        if indent == 0:
            accountList = []
            accountList.append((indent,accountName))
        elif indent > prev_indent:
            accountList.append((indent,accountName))
        elif indent <= prev_indent:
            max_indent = int(max(accountList,key=itemgetter(0))[0])
            while max_indent >= indent:
                accountList.pop()
                max_indent = int(max(accountList,key=itemgetter(0))[0])
            accountList.append((indent,accountName))

So at each row of output the accountList is complete.

Upvotes: 8

Answers (3)

jaksco

Reputation: 531

Laurent's answer was very helpful but it was doing confusing things when I tried to create an array instead of printing directly to stdout. Here is how I was able to make it work:

from copy import deepcopy

def parse(f):
    d = []
    stack = []
    for content in f.splitlines():
        row = content.split("    ")
        stack[:] = stack[: len(row) - 1] + [row[-1]]
        s = deepcopy(stack)
        d.append(s)

    return d

Upvotes: 0

Right leg

Reputation: 16730

You can mimick the way Python actually parses the indentation. First, create a stack that will contain the indentation levels. At each line:

If the indentation is bigger than the top of the stack, push it and increase the depth level.
If it is the same, continue at the same level.
If it is lower, pop the top of the stack while it is higher than the new indentation. If you find a lower indentation level before finding exactly the same, then there is an indentation error.

indentation = []
indentation.append(0)
depth = 0

f = open("test.txt", 'r')

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    print(f"{content} (depth: {depth})")

With a "test.txt" file whose content is as you provided:

Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses

Here is the output:

Income (depth: 0)
Revenue (depth: 1)
IAP (depth: 2)
Ads (depth: 2)
Other-Income (depth: 1)
Expenses (depth: 0)
Developers (depth: 1)
In-house (depth: 2)
Contractors (depth: 2)
Advertising (depth: 1)
Other Expense (depth: 1)

So, what can you do with this? Suppose you want to build nested lists. First, create a data stack.

When you find an indentation, append a new list at the end of the data stack.
When you find an unindentation, pop the top list, and append it to the new top.

And regardless, for each line, append the content to the list at the top of the data stack.

Here is the corresponding implementation:

for line in f:
    line = line[:-1]

    content = line.strip()
    indent = len(line) - len(content)
    if indent > indentation[-1]:
        depth += 1
        indentation.append(indent)
        data.append([])

    elif indent < indentation[-1]:
        while indent < indentation[-1]:
            depth -= 1
            indentation.pop()
            top = data.pop()
            data[-1].append(top)

        if indent != indentation[-1]:
            raise RuntimeError("Bad formatting")

    data[-1].append(content)

while len(data) > 1:
    top = data.pop()
    data[-1].append(top)

Your nested list is at the top of your data stack. The output for the same file is:

['Income',
    ['Revenue',
        ['IAP',
         'Ads'
        ],
     'Other-Income'
    ],
 'Expenses',
    ['Developers',
        ['In-house',
         'Contractors'
        ],
     'Advertising',
     'Other Expense'
    ]
 ]

This is rather easy to manipulate, although quite deeply nested. You can access the data by chaining the item accesses:

>>> l = data[0]
>>> l
['Income', ['Revenue', ['IAP', 'Ads'], 'Other-Income'], 'Expenses', ['Developers', ['In-house', 'Contractors'], 'Advertising', 'Other Expense']]
>>> l[1]
['Revenue', ['IAP', 'Ads'], 'Other-Income']
>>> l[1][1]
['IAP', 'Ads']
>>> l[1][1][0]
'IAP'

Upvotes: 11

Laurent LAPORTE

Reputation: 22992

If the indentation is a fixed amount of spaces (3 spaces here), you can simplify the calculation of the indentation level.

note: I use a StringIO to simulate a file

import io
import itertools

content = u"""\
Income
   Revenue
      IAP
      Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
   Advertising
   Other Expenses
"""

stack = []
for line in io.StringIO(content):
    content = line.rstrip()  # drop \n
    row = content.split("   ")
    stack[:] = stack[:len(row) - 1] + [row[-1]]
    print("\t".join(stack))

You get:

Income
Income  Revenue
Income  Revenue IAP
Income  Revenue Ads
Income  Other-Income
Expenses
Expenses    Developers
Expenses    Developers  In-house
Expenses    Developers  Contractors
Expenses    Advertising
Expenses    Other Expenses

EDIT: indentation not fixed

If the indentation isn't fixed (you don't always have 3 spaces) like in the example below:

content = u"""\
Income
   Revenue
    IAP
    Ads
   Other-Income
Expenses
   Developers
      In-house
      Contractors
  Advertising
  Other Expenses
"""

You need to estimate the shifting at each new line:

stack = []
last_indent = u""
for line in io.StringIO(content):
    indent = "".join(itertools.takewhile(lambda c: c == " ", line))
    shift = 0 if indent == last_indent else (-1 if len(indent) < len(last_indent) else 1)
    index = len(stack) + shift
    stack[:] = stack[:index - 1] + [line.strip()]
    last_indent = indent
    print("\t".join(stack))

Upvotes: 7

How to parse hierarchy based on indents with python

Answers (3)

Related Questions