Reputation: 795
I have a dataframe named df like this: (there's no duplicate rows of df)
a_id b_id
111111 18
111111 17
222222 18
333333 14
444444 13
555555 18
555555 24
222222 13
222222 17
333333 17
And I want to invert it to a dataframe df_2 like this:
a_one a_two b_list number_of_b
222222 444444 13 1
111111 222222 17,18 2
111111 333333 17 1
111111 222222 17 1
222222 333333 17 1
111111 555555 18 1
222222 555555 18 1
If a_id share the same b_id, they become a pair on df_2;
the b_list of df_2 is the correspondingly b_id;
the number_of_b is the length of b_list
Upvotes: 1
Views: 102
Reputation: 19947
Use a chained operation with a series of groupby and transformations:
from itertools import combinations
df2 = (
df.groupby('b_id')['a_id']
.apply(lambda x: list(combinations(x.values,2)))
.apply(pd.Series).stack()
.reset_index(0)
.groupby(0).apply(lambda x: [len(x), x.b_id.astype(str).tolist()])
.apply(pd.Series)
.rename(columns={0:'b_list', 1:'number_of_b'})
.pipe(lambda x: x.reset_index(drop=True)
.join(x.reset_index()[0].apply(pd.Series)))
.rename(columns={0:'a_one', 1:'a_two'})
.assign(number_of_b=lambda x: x.number_of_b.str.join(','))
)[['a_one','a_two','b_list','number_of_b']]
df2
Out[123]:
a_one a_two b_list number_of_b
0 111111 222222 2 17,18
1 111111 333333 1 17
2 111111 555555 1 18
3 222222 333333 1 17
4 222222 555555 1 18
5 444444 222222 1 13
Upvotes: 1
Reputation: 609
I have a solution:
First, make the combinations of a_id
that have the same b_id
:
from itertools import combinations
df = df.groupby("b_id").apply(lambda x: list(combinations(x["a_id"], 2))).apply(pd.Series).stack()
df
now is:
b_id
13 0 (444444, 222222)
17 0 (111111, 222222)
1 (111111, 333333)
2 (222222, 333333)
18 0 (111111, 222222)
1 (111111, 555555)
2 (222222, 555555)
Then split the Series, reset the index and concat the appearance of b_id
:
df = df.apply(pd.Series).reset_index().groupby([0,1])["b_id"].apply(lambda x:x.values).reset_index()
Now we get:
0 1 b_id
0 111111 222222 [17, 18]
1 111111 333333 [17]
2 111111 555555 [18]
3 222222 333333 [17]
4 222222 555555 [18]
5 444444 222222 [13]
This is almost what you need. And for the exact results:
df.columns = ["a_one", "a_two", "b_list"]
df["number_of_b"] = df.b_list.apply(len)
The final results:
a_one a_two b_list number_of_b
0 111111 222222 [17, 18] 2
1 111111 333333 [17] 1
2 111111 555555 [18] 1
3 222222 333333 [17] 1
4 222222 555555 [18] 1
5 444444 222222 [13] 1
The whole code for clarity:
from itertools import combinations
df = df.groupby("b_id").apply(lambda x: list(combinations(x["a_id"], 2))).apply(pd.Series).stack()
df = df.apply(pd.Series).reset_index().groupby([0,1])["b_id"].apply(lambda x:x.values).reset_index()
df.columns = ["a_one", "a_two", "b_list"]
df["number_of_b"] = df.b_list.apply(len)
This is not that fancy. Look forward to better solutions!
Upvotes: 1