Aureliano Guedes
Aureliano Guedes

Reputation: 785

Python + pandas replace column values based in other column with

My question is simple, let's start with an hypothetical table

| a | b | c |  
| 0 | 0 | 0 |  
| 0 | 0 | 0 |  
| 0 | 1 | 1 |  
| 0 | 1 | 2 |  
| 1 | 2 | 3 |  
| 1 | 2 | 3 |  
| 1 | 3 | 4 |  

This table should be modified, the numbers in column b don't be the same existing in column a, and column c don't be the same number in column b.
So the number should be incremented continually.
The final should looks like this:

| a | b | c  |  
| 0 | 2 | 6  |  
| 0 | 2 | 6  |  
| 0 | 3 | 7  |  
| 0 | 3 | 8  |  
| 1 | 4 | 9  |  
| 1 | 4 | 9  |  
| 1 | 5 | 10 |  

Finally, has more one table. Based in the last table, since c is child of b and b child of a, this table should be converted in two columns table with 3rd column with the height of each row:

| parent | child | size |  
| 0      | 2     | 2    |  
| 0      | 3     | 2    |  
| 1      | 4     | 2    |  
| 1      | 5     | 1    |  
| 2      | 6     | 2    |  
| 3      | 7     | 1    |  
| 3      | 8     | 1    |  
| 4      | 9     | 2    |  
| 5      | 10    | 1    |  

In this 3rd table, all rows are unique and map all tree using only two columns. The 3rd column is the height of each row, that's means, how many times each pair appears in second table.

Upvotes: 1

Views: 198

Answers (2)

piRSquared
piRSquared

Reputation: 294516

My go at both

n = len(df)
col = df.columns.repeat(n)
val = df.values.ravel(order='F')
f, u = pd.factorize(list(zip(col, val)))
table1 = pd.DataFrame(f.reshape(df.shape, order='F'), df.index, df.columns)

pc, pc_ = pd.factorize(list(zip(f[:-n], f[n:])))
table2 = pd.DataFrame(
    np.column_stack([pc_.tolist(), np.bincount(pc)]),
    columns=list('pcn')
)

print(table1, table2, sep='\n\n')


   a  b   c
0  0  2   6
1  0  2   6
2  0  3   7
3  0  3   8
4  1  4   9
5  1  4   9
6  1  5  10

   p   c  n
0  0   2  2
1  0   3  2
2  1   4  2
3  1   5  1
4  2   6  2
5  3   7  1
6  3   8  1
7  4   9  2
8  5  10  1

Upvotes: 0

BENY
BENY

Reputation: 323376

IIUC using factorize

import pandas as pd 
import numpy as np

s=df.stack().sort_index(level=[1])
s=pd.Series(pd.factorize(s.astype(str)+s.index.get_level_values(level=1))[0],index=s.index).unstack()
s
Out[665]:
   a  b   c
0  0  2   6
1  0  2   6
2  0  3   7
3  0  3   8
4  1  4   9
5  1  4   9
6  1  5  10


yourdf=pd.DataFrame(np.append(s[['a','b']].values,s[['b','c']].values,axis=0),columns=['p','c']).groupby(['p','c']).size().reset_index()
yourdf
Out[681]: 
   p   c  0
0  0   2  2
1  0   3  2
2  1   4  2
3  1   5  1
4  2   6  2
5  3   7  1
6  3   8  1
7  4   9  2
8  5  10  1

Upvotes: 1

Related Questions