Reputation: 581
I have a df(small portion below). I am trying to add an additional tw0 columns rolled_doc_cnt
& rolled_doc_cnt_all
that contains the sum of doc_cnt
/doc_cnt_all
for the current row and all of its children. In the very limited rows below, df.at[0,'rolled_doc_cnt'] = 1317
& df.at[0,'rolled_doc_cnt_all'] = 3540
SYMBOL level not-allocatable additional-only doc_cnt doc_cnt_all parent
0 A 2 True False 0 0
1 A01 4 True False 0 0 A
2 A01B 5 True False 0 0 A01
3 A01B 1/00 7 False False 198 244 A01B
4 A01B 1/02 8 False False 230 538 A01B 1/00
5 A01B 1/022 9 False False 83 238 A01B 1/02
6 A01B 1/024 9 False False 28 63 A01B 1/02
7 A01B 1/026 9 False False 100 120 A01B 1/02
8 A01B 1/028 9 False False 27 82 A01B 1/02
9 A01B 1/04 9 False False 29 54 A01B 1/02
10 A01B 1/06 8 False False 78 508 A01B 1/00
11 A01B 1/065 9 False False 118 150 A01B 1/06
12 A01B 1/08 9 False False 71 326 A01B 1/06
13 A01B 1/10 9 False False 14 30 A01B 1/06
14 A01B 1/12 9 False False 24 86 A01B 1/06
15 A01B 1/14 9 False False 44 131 A01B 1/06
16 A01B 1/16 8 False False 159 518 A01B 1/00
17 A01B 1/165 9 False False 50 114 A01B 1/16
18 A01B 1/18 9 False False 64 338 A01B 1/16
I got some help creating the parent
column here.
def GetParent():
# level 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19, 20
hierarchy = [0, 0, 0, 0, 2, 4, 0, 5, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
parent = ['']*len(hierarchy)
def func(row):
# print(row)
symbol, level = row[['SYMBOL', 'level']]
parent_level = hierarchy[level]
parent_symbol = parent[parent_level]
parent[level] = symbol
return pd.Series([parent_symbol], index=['parent'])
return func
# create a column with the parents
st = time()
parents = dfa.apply(GetParent(), axis=1)
dfa = pd.concat([dfa, parents], axis=1)
print((time()-st)/60, 'minutes elapsed')
I tried to debug this code in spyder so I could see how the list parent
changes as I advance rows of the df, but I cannot figure out how to jump into the function GetParent()
without also jumping into the pandas function apply()
. Jumping into apply()
eventually leads me to a recursion error.
I tried to make a few modification to GetParents()
to also keep track of the document counts of each symbol at each level, but then I realized that I was keeping track of the document counts for the parent nodes, but not the children. So, with the df above, how would I be able to create something like the following df?
SYMBOL level not-allocatable additional-only doc_cnt doc_cnt_all parent rolled_doc_cnt rolled_doc_cnt_all
0 A 2 TRUE FALSE 0 0 1317 3540
1 A01 4 TRUE FALSE 0 0 A 1317 3540
2 A01B 5 TRUE FALSE 0 0 A01 1317 3540
3 A01B 1/00 7 FALSE FALSE 198 244 A01B 1317 3540
4 A01B 1/02 8 FALSE FALSE 230 538 A01B 1/00 497 1095
5 A01B 1/022 9 FALSE FALSE 83 238 A01B 1/02 83 238
6 A01B 1/024 9 FALSE FALSE 28 63 A01B 1/02 28 63
7 A01B 1/026 9 FALSE FALSE 100 120 A01B 1/02 100 120
8 A01B 1/028 9 FALSE FALSE 27 82 A01B 1/02 27 82
9 A01B 1/04 9 FALSE FALSE 29 54 A01B 1/02 29 54
10 A01B 1/06 8 FALSE FALSE 78 508 A01B 1/00 349 1231
11 A01B 1/065 9 FALSE FALSE 118 150 A01B 1/06 118 150
12 A01B 1/08 9 FALSE FALSE 71 326 A01B 1/06 71 326
13 A01B 1/10 9 FALSE FALSE 14 30 A01B 1/06 14 30
14 A01B 1/12 9 FALSE FALSE 24 86 A01B 1/06 24 86
15 A01B 1/14 9 FALSE FALSE 44 131 A01B 1/06 44 131
16 A01B 1/16 8 FALSE FALSE 159 518 A01B 1/00 273 970
17 A01B 1/165 9 FALSE FALSE 50 114 A01B 1/16 50 114
18 A01B 1/18 9 FALSE FALSE 64 338 A01B 1/16 64 338
Also feel free to tell me that the way I am trying to do this is not optimal and suggest another way
Upvotes: 0
Views: 279
Reputation: 30991
The function to get parent symbol (GetParent) can be simpified a bit:
def GetParent():
hierarchy = [0, 0, 0, 0, 2, 4, 0, 5, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19]
parent = ['']*len(hierarchy)
def func(row):
symbol, level = row[['SYMBOL', 'level']]
parent_level = hierarchy[level]
parent_symbol = parent[parent_level]
parent[level] = symbol
return parent_symbol
return func
i.e. it returns just the parent_symbol (not Series) and then it can be applied with direct creation of the target column:
dfa['parent'] = dfa.apply(GetParent(), axis=1)
I tried this function on your sample data and got no error.
Note that GetParent is able to process the hierarchy up to 20 levels, so maybe the source of your error is that your full data has more hierarchy levels?
You didn't write anything where this error occured and what was the error message. Start looking for the error cause from this point.
Another hint: Start from running this code on some initial part of your data (e.g. the first half). The goal is to find the source row, on which the error occurs. Then look very thoroughly at this row (and a bit before it). This is likely to give you some clue about the reason of error.
I came up with a solution that should work regardless of the hierarchy depth.
def GetParent():
par = {0: ''}
def func(row):
symbol, level = row[['SYMBOL', 'level']]
parLevel = level - 2 if level in [2, 4, 7] else level - 1
parSym = par[parLevel]
par[level] = symbol
return parSym
return func
Note than the issus of "missing hierarchy levels" is addressed in:
parLevel = level - 2 if level in [2, 4, 7] else level - 1
which returns level - 2 for levels "just after" the missing levels and level - 1 for all other levels.
The parent symbol is kept in an "internal" dictionary, so there is no problem with list index out of bounds.
The apply this function just as above, directly creating the target column.
I see that you managed to compute ancestors, so now how to compute sums of the columns of interest.
As far as operations on tree structures in Pandas are concerned, I found an interesting piece of code in StackOverflow on page:
Hierarchical data: efficiently build a list of every descendant for each node
Copy from there list_ancestors function and 2 other functions used by it, namely trace_nodes and numpy_col_inner_many_to_one_join.
I will not repeat this code here.
Then run:
links = list_ancestors(dfa[['SYMBOL', 'parent']].values)
list_ancestors function generates a DataFrame, with descendant and ancestor columns, including "links" between an ancestor and descendant at any depth (till now you had only direct descendants).
To add SYMBOL, doc_cnt and doc_cnt_all columns, run:
links2 = pd.merge(links, dfa[['SYMBOL', 'doc_cnt', 'doc_cnt_all']],
left_on='descendant', right_on='SYMBOL', copy=False)
And now start summing:
s1 = links2.groupby('ancestor')['doc_cnt', 'doc_cnt_all'].apply(sum)
s2 = dfa[['SYMBOL', 'doc_cnt', 'doc_cnt_all']].set_index('SYMBOL')
s3 = pd.concat([s1, s2]).groupby(level=0).apply(sum)
A bit of explanation:
The only 2 steps remaining to do are:
The code to do it is:
s3.columns=['rolled_doc_cnt', 'rolled_doc_cnt_all']
dfa.join(s3, on='SYMBOL')
For your test data, I got just the expected result.
Upvotes: 1