Reputation: 785
My question is simple, let's start with an hypothetical table
| a | b | c |
| 0 | 0 | 0 |
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 0 | 1 | 2 |
| 1 | 2 | 3 |
| 1 | 2 | 3 |
| 1 | 3 | 4 |
This table should be modified, the numbers in column b don't be the same
existing in column a, and column c don't be the same number in column b.
So the number should be incremented continually.
The final should looks like this:
| a | b | c |
| 0 | 2 | 6 |
| 0 | 2 | 6 |
| 0 | 3 | 7 |
| 0 | 3 | 8 |
| 1 | 4 | 9 |
| 1 | 4 | 9 |
| 1 | 5 | 10 |
Finally, has more one table. Based in the last table, since c is child of b and b child of a, this table should be converted in two columns table with 3rd column with the height of each row:
| parent | child | size |
| 0 | 2 | 2 |
| 0 | 3 | 2 |
| 1 | 4 | 2 |
| 1 | 5 | 1 |
| 2 | 6 | 2 |
| 3 | 7 | 1 |
| 3 | 8 | 1 |
| 4 | 9 | 2 |
| 5 | 10 | 1 |
In this 3rd table, all rows are unique and map all tree using only two columns. The 3rd column is the height of each row, that's means, how many times each pair appears in second table.
Upvotes: 1
Views: 198
Reputation: 294516
My go at both
n = len(df)
col = df.columns.repeat(n)
val = df.values.ravel(order='F')
f, u = pd.factorize(list(zip(col, val)))
table1 = pd.DataFrame(f.reshape(df.shape, order='F'), df.index, df.columns)
pc, pc_ = pd.factorize(list(zip(f[:-n], f[n:])))
table2 = pd.DataFrame(
np.column_stack([pc_.tolist(), np.bincount(pc)]),
columns=list('pcn')
)
print(table1, table2, sep='\n\n')
a b c
0 0 2 6
1 0 2 6
2 0 3 7
3 0 3 8
4 1 4 9
5 1 4 9
6 1 5 10
p c n
0 0 2 2
1 0 3 2
2 1 4 2
3 1 5 1
4 2 6 2
5 3 7 1
6 3 8 1
7 4 9 2
8 5 10 1
Upvotes: 0
Reputation: 323376
IIUC using factorize
import pandas as pd
import numpy as np
s=df.stack().sort_index(level=[1])
s=pd.Series(pd.factorize(s.astype(str)+s.index.get_level_values(level=1))[0],index=s.index).unstack()
s
Out[665]:
a b c
0 0 2 6
1 0 2 6
2 0 3 7
3 0 3 8
4 1 4 9
5 1 4 9
6 1 5 10
yourdf=pd.DataFrame(np.append(s[['a','b']].values,s[['b','c']].values,axis=0),columns=['p','c']).groupby(['p','c']).size().reset_index()
yourdf
Out[681]:
p c 0
0 0 2 2
1 0 3 2
2 1 4 2
3 1 5 1
4 2 6 2
5 3 7 1
6 3 8 1
7 4 9 2
8 5 10 1
Upvotes: 1