Philip
Philip

Reputation: 185

Take union of two columns, Python + Pandas

I have a df arranged like follows:

   x    y    z
0  a   jj  Nan
1  b   ii   mm
2  c   kk   nn
3  d   ii  NaN
4  e  Nan   oo
5  f   jj   mm
6  g  Nan   nn

The desired output is:

   x    y    z   w
0  a   jj  Nan   a
1  b   ii   mm   a
2  c   kk   nn   c
3  d   ii  NaN   a
4  e  Nan   oo   e
5  f   jj   mm   a
6  g  Nan   nn   c

The logic is

  1. to take union of column y & z : ii == jj since in index 1 and 5, they both have mm in column z

  2. group this union : index 0,1,3,5 are a group, index 2,6 are another group

  3. within the group, randomly take one cell in column x and assign it to column w for the whole group

I have no clue at all about this problem. Can somebody help me?

EDITNOTE:

I was first post a perfectly sorted column y and column z like follows:

   x    y    z   w
0  a   ii  NaN   a
1  b   ii   mm   a
2  c   jj   mm   a
3  d   jj  Nan   a
4  e   kk   nn   e
5  f  Nan   nn   e
6  g  Nan   oo   g

For this case, piRSquared's solution works perfect.

EDITNOTE2:

Nickil Maveli's solution works perfect for my problem. However, I noted that there's a situation that the solution can not handle, that is :

   x   y   z
0  a  ii  mm
1  b  ii  nn
2  c  jj  nn
3  d  jj  oo
4  e  kk  oo

By Nickil Maveli's solution, the result would be like follows:

   0   1   2  w
0  a  ii  mm  a
1  b  ii  mm  a
2  c  jj  nn  c
3  d  jj  nn  c
4  e  kk  oo  e

However, the desired output should be w = ['a', 'a', 'a', 'a', 'a'].

Upvotes: 4

Views: 1151

Answers (3)

DSM
DSM

Reputation: 353389

In the general case this is a set consolidation/connected components problem. While if we assume certain things about your data we can solve a reduced case, it's just a bit of bookkeeping to do the whole thing.

scipy has a connected components function we can use if we do some preparation:

import scipy.sparse

def via_cc(df_in):
    df = df_in.copy()

    # work with ranked version
    dfr = df[["y","z"]].rank(method='dense')
    # give nans their own temporary rank
    dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
    # don't let y and z get mixed up; have separate nodes per column
    dfr["z"] += dfr["y"].max() 

    # build the adjacency matrix
    size = int(dfr.max().max()) + 1
    m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
                                (size, size))

    # do the work to find the groups
    _, cc = scipy.sparse.csgraph.connected_components(m)

    # get the group codes
    group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
    # fill in w from x appropriately
    df["w"] = df["x"].groupby(group).transform(min)

    return df

which gives me

In [230]: via_cc(df0)
Out[230]: 
   x    y    z  w
0  a   jj  NaN  a
1  b   ii   mm  a
2  c   kk   nn  c
3  d   ii  NaN  a
4  e  NaN   oo  e
5  f   jj   mm  a
6  g  NaN   nn  c

In [231]: via_cc(df1)
Out[231]: 
   x   y   z  w
0  a  ii  mm  a
1  b  ii  nn  a
2  c  jj  nn  a
3  d  jj  oo  a
4  e  kk  oo  a

If you have a set consolidation recipe around, like the one here, you can simplify some of the above at the cost of an external function.

(Aside: note that in my df0, the "Nan"s are really NaNs. If you have a string "Nan" (note how it's different from NaN), then the code will think it's just another string and will assume that you want all "Nan"s to be in the same group.)

Upvotes: 2

Nickil Maveli
Nickil Maveli

Reputation: 29719

Make all null strings as NaN values by replacing them. Next, group them according to 'y' and fill all the missing values with the value corresponding to it's first valid index present in 'z'.

Then, perform groupby operation on 'z', by applying sum which aggregates all the values present in 'x' together. Slice it accordingly to fill all the values in that group with that particular value(Here, slice=0).

Convert it to a dictionary to create the mapping and finally assign it back to a new column, 'w' as shown:

df_new = df.replace('Nan', np.NaN)
df_new['z'] = df_new.groupby('y')['z'].transform(lambda x: x.loc[x.first_valid_index()])
df['w'] = df_new['z'].map(df_new.groupby('z')['x'].apply(lambda x: x.sum()[0]).to_dict())
df

Image

Upvotes: 1

piRSquared
piRSquared

Reputation: 294498

This one is tricky!

I first evaluate which elements share the same 'y' values as it's neighbor.
Then I check who has the same 'z' as their neighbor.
A new group is when neither of these things are true.

y_chk = df.y.eq(df.y.shift())
z_chk = df.z.eq(df.z.shift())
grps = (~y_chk & ~z_chk).cumsum()
df['w'] = df.groupby(grps).x.transform(pd.Series.head, n=1)
df

enter image description here

Upvotes: 2

Related Questions