Britt
Britt

Reputation: 581

Sum values of parent nodes in a dataframe

I have a df(small portion below). I am trying to add an additional tw0 columns rolled_doc_cnt & rolled_doc_cnt_all that contains the sum of doc_cnt/doc_cnt_all for the current row and all of its children. In the very limited rows below, df.at[0,'rolled_doc_cnt'] = 1317 & df.at[0,'rolled_doc_cnt_all'] = 3540

    SYMBOL  level   not-allocatable additional-only doc_cnt doc_cnt_all parent
0   A   2   True    False   0   0   
1   A01 4   True    False   0   0   A
2   A01B    5   True    False   0   0   A01
3   A01B   1/00 7   False   False   198 244 A01B
4   A01B   1/02 8   False   False   230 538 A01B   1/00
5   A01B   1/022    9   False   False   83  238 A01B   1/02
6   A01B   1/024    9   False   False   28  63  A01B   1/02
7   A01B   1/026    9   False   False   100 120 A01B   1/02
8   A01B   1/028    9   False   False   27  82  A01B   1/02
9   A01B   1/04 9   False   False   29  54  A01B   1/02
10  A01B   1/06 8   False   False   78  508 A01B   1/00
11  A01B   1/065    9   False   False   118 150 A01B   1/06
12  A01B   1/08 9   False   False   71  326 A01B   1/06
13  A01B   1/10 9   False   False   14  30  A01B   1/06
14  A01B   1/12 9   False   False   24  86  A01B   1/06
15  A01B   1/14 9   False   False   44  131 A01B   1/06
16  A01B   1/16 8   False   False   159 518 A01B   1/00
17  A01B   1/165    9   False   False   50  114 A01B   1/16
18  A01B   1/18 9   False   False   64  338 A01B   1/16

I got some help creating the parent column here.

def GetParent():
    # level      0  1  2  3  4  5  6  7  8  9  10 11  12  13  14  15  16  17  18  19, 20
    hierarchy = [0, 0, 0, 0, 2, 4, 0, 5, 7, 8, 9,
                 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
    parent = ['']*len(hierarchy)

    def func(row):
        # print(row)
        symbol, level = row[['SYMBOL', 'level']]

        parent_level = hierarchy[level]
        parent_symbol = parent[parent_level]

        parent[level] = symbol

        return pd.Series([parent_symbol], index=['parent'])

    return func


# create a column with the parents
st = time()
parents = dfa.apply(GetParent(), axis=1)
dfa = pd.concat([dfa, parents], axis=1)
print((time()-st)/60, 'minutes elapsed')

I tried to debug this code in spyder so I could see how the list parent changes as I advance rows of the df, but I cannot figure out how to jump into the function GetParent() without also jumping into the pandas function apply(). Jumping into apply() eventually leads me to a recursion error.

I tried to make a few modification to GetParents() to also keep track of the document counts of each symbol at each level, but then I realized that I was keeping track of the document counts for the parent nodes, but not the children. So, with the df above, how would I be able to create something like the following df?

    SYMBOL  level   not-allocatable additional-only doc_cnt doc_cnt_all parent  rolled_doc_cnt  rolled_doc_cnt_all
0   A   2   TRUE    FALSE   0   0       1317    3540
1   A01 4   TRUE    FALSE   0   0   A   1317    3540
2   A01B    5   TRUE    FALSE   0   0   A01 1317    3540
3   A01B   1/00 7   FALSE   FALSE   198 244 A01B    1317    3540
4   A01B   1/02 8   FALSE   FALSE   230 538 A01B   1/00 497 1095
5   A01B   1/022    9   FALSE   FALSE   83  238 A01B   1/02 83  238
6   A01B   1/024    9   FALSE   FALSE   28  63  A01B   1/02 28  63
7   A01B   1/026    9   FALSE   FALSE   100 120 A01B   1/02 100 120
8   A01B   1/028    9   FALSE   FALSE   27  82  A01B   1/02 27  82
9   A01B   1/04 9   FALSE   FALSE   29  54  A01B   1/02 29  54
10  A01B   1/06 8   FALSE   FALSE   78  508 A01B   1/00 349 1231
11  A01B   1/065    9   FALSE   FALSE   118 150 A01B   1/06 118 150
12  A01B   1/08 9   FALSE   FALSE   71  326 A01B   1/06 71  326
13  A01B   1/10 9   FALSE   FALSE   14  30  A01B   1/06 14  30
14  A01B   1/12 9   FALSE   FALSE   24  86  A01B   1/06 24  86
15  A01B   1/14 9   FALSE   FALSE   44  131 A01B   1/06 44  131
16  A01B   1/16 8   FALSE   FALSE   159 518 A01B   1/00 273 970
17  A01B   1/165    9   FALSE   FALSE   50  114 A01B   1/16 50  114
18  A01B   1/18 9   FALSE   FALSE   64  338 A01B   1/16 64  338

Also feel free to tell me that the way I am trying to do this is not optimal and suggest another way

Upvotes: 0

Views: 279

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30991

The function to get parent symbol (GetParent) can be simpified a bit:

def GetParent():
    hierarchy = [0, 0, 0, 0, 2, 4, 0, 5, 7, 8, 9, 10,
                 11, 12, 13, 14, 15, 16, 17, 18, 19]
    parent = ['']*len(hierarchy)
    def func(row):
        symbol, level = row[['SYMBOL', 'level']]
        parent_level = hierarchy[level]
        parent_symbol = parent[parent_level]
        parent[level] = symbol
        return parent_symbol
    return func

i.e. it returns just the parent_symbol (not Series) and then it can be applied with direct creation of the target column:

dfa['parent'] = dfa.apply(GetParent(), axis=1)

I tried this function on your sample data and got no error.

Note that GetParent is able to process the hierarchy up to 20 levels, so maybe the source of your error is that your full data has more hierarchy levels?

You didn't write anything where this error occured and what was the error message. Start looking for the error cause from this point.

Another hint: Start from running this code on some initial part of your data (e.g. the first half). The goal is to find the source row, on which the error occurs. Then look very thoroughly at this row (and a bit before it). This is likely to give you some clue about the reason of error.

Edit

I came up with a solution that should work regardless of the hierarchy depth.

def GetParent():
    par = {0: ''}
    def func(row):
        symbol, level = row[['SYMBOL', 'level']]
        parLevel = level - 2 if level in [2, 4, 7] else level - 1
        parSym = par[parLevel]
        par[level] = symbol
            return parSym
    return func

Note than the issus of "missing hierarchy levels" is addressed in:

parLevel = level - 2 if level in [2, 4, 7] else level - 1

which returns level - 2 for levels "just after" the missing levels and level - 1 for all other levels.

The parent symbol is kept in an "internal" dictionary, so there is no problem with list index out of bounds.

The apply this function just as above, directly creating the target column.

Edit 2

I see that you managed to compute ancestors, so now how to compute sums of the columns of interest.

As far as operations on tree structures in Pandas are concerned, I found an interesting piece of code in StackOverflow on page:

Hierarchical data: efficiently build a list of every descendant for each node

Copy from there list_ancestors function and 2 other functions used by it, namely trace_nodes and numpy_col_inner_many_to_one_join.

I will not repeat this code here.

Then run:

links = list_ancestors(dfa[['SYMBOL', 'parent']].values)

list_ancestors function generates a DataFrame, with descendant and ancestor columns, including "links" between an ancestor and descendant at any depth (till now you had only direct descendants).

To add SYMBOL, doc_cnt and doc_cnt_all columns, run:

links2 = pd.merge(links, dfa[['SYMBOL', 'doc_cnt', 'doc_cnt_all']],
    left_on='descendant', right_on='SYMBOL', copy=False)

And now start summing:

s1 = links2.groupby('ancestor')['doc_cnt', 'doc_cnt_all'].apply(sum)
s2 = dfa[['SYMBOL', 'doc_cnt', 'doc_cnt_all']].set_index('SYMBOL')
s3 = pd.concat([s1, s2]).groupby(level=0).apply(sum)

A bit of explanation:

  • s1 contains sums (of both columns of interest) for all descendants (at any depth), but without "own" values for these columns.
  • s2 in turn contains just the "missing" own values.
  • s3 sums these values.

The only 2 steps remaining to do are:

  • change column names to the target names,
  • join dfa with s3.

The code to do it is:

s3.columns=['rolled_doc_cnt', 'rolled_doc_cnt_all']
dfa.join(s3, on='SYMBOL')

For your test data, I got just the expected result.

Upvotes: 1

Related Questions