Reputation: 185
I have a df arranged like follows:
x y z
0 a jj Nan
1 b ii mm
2 c kk nn
3 d ii NaN
4 e Nan oo
5 f jj mm
6 g Nan nn
The desired output is:
x y z w
0 a jj Nan a
1 b ii mm a
2 c kk nn c
3 d ii NaN a
4 e Nan oo e
5 f jj mm a
6 g Nan nn c
The logic is
to take union of column y & z : ii == jj
since in index 1 and 5, they both have mm
in column z
group this union : index 0,1,3,5 are a group, index 2,6 are another group
within the group, randomly take one cell in column x and assign it to column w for the whole group
I have no clue at all about this problem. Can somebody help me?
EDITNOTE:
I was first post a perfectly sorted column y and column z like follows:
x y z w
0 a ii NaN a
1 b ii mm a
2 c jj mm a
3 d jj Nan a
4 e kk nn e
5 f Nan nn e
6 g Nan oo g
For this case, piRSquared's solution works perfect.
EDITNOTE2:
Nickil Maveli's solution works perfect for my problem. However, I noted that there's a situation that the solution can not handle, that is :
x y z
0 a ii mm
1 b ii nn
2 c jj nn
3 d jj oo
4 e kk oo
By Nickil Maveli's solution, the result would be like follows:
0 1 2 w
0 a ii mm a
1 b ii mm a
2 c jj nn c
3 d jj nn c
4 e kk oo e
However, the desired output should be w = ['a', 'a', 'a', 'a', 'a'].
Upvotes: 4
Views: 1151
Reputation: 353389
In the general case this is a set consolidation/connected components problem. While if we assume certain things about your data we can solve a reduced case, it's just a bit of bookkeeping to do the whole thing.
scipy has a connected components function we can use if we do some preparation:
import scipy.sparse
def via_cc(df_in):
df = df_in.copy()
# work with ranked version
dfr = df[["y","z"]].rank(method='dense')
# give nans their own temporary rank
dfr = dfr.fillna(dfr.max().fillna(0) + dfr.isnull().cumsum(axis=0))
# don't let y and z get mixed up; have separate nodes per column
dfr["z"] += dfr["y"].max()
# build the adjacency matrix
size = int(dfr.max().max()) + 1
m = scipy.sparse.coo_matrix(([1]*len(dfr), (dfr.y, dfr.z)),
(size, size))
# do the work to find the groups
_, cc = scipy.sparse.csgraph.connected_components(m)
# get the group codes
group = pd.Series(cc[dfr["y"].astype(int).values], index=dfr.index)
# fill in w from x appropriately
df["w"] = df["x"].groupby(group).transform(min)
return df
which gives me
In [230]: via_cc(df0)
Out[230]:
x y z w
0 a jj NaN a
1 b ii mm a
2 c kk nn c
3 d ii NaN a
4 e NaN oo e
5 f jj mm a
6 g NaN nn c
In [231]: via_cc(df1)
Out[231]:
x y z w
0 a ii mm a
1 b ii nn a
2 c jj nn a
3 d jj oo a
4 e kk oo a
If you have a set consolidation recipe around, like the one here, you can simplify some of the above at the cost of an external function.
(Aside: note that in my df0, the "Nan"s are really NaNs. If you have a string "Nan" (note how it's different from NaN), then the code will think it's just another string and will assume that you want all "Nan"s to be in the same group.)
Upvotes: 2
Reputation: 29719
Make all null strings as NaN
values by replacing them. Next, group them according to 'y' and fill all the missing values with the value corresponding to it's first valid index present in 'z'.
Then, perform groupby operation on 'z', by applying sum which aggregates all the values present in 'x' together. Slice it accordingly to fill all the values in that group with that particular value(Here, slice=0).
Convert it to a dictionary to create the mapping and finally assign it back to a new column, 'w' as shown:
df_new = df.replace('Nan', np.NaN)
df_new['z'] = df_new.groupby('y')['z'].transform(lambda x: x.loc[x.first_valid_index()])
df['w'] = df_new['z'].map(df_new.groupby('z')['x'].apply(lambda x: x.sum()[0]).to_dict())
df
Upvotes: 1
Reputation: 294498
This one is tricky!
I first evaluate which elements share the same 'y'
values as it's neighbor.
Then I check who has the same 'z'
as their neighbor.
A new group is when neither of these things are true.
y_chk = df.y.eq(df.y.shift())
z_chk = df.z.eq(df.z.shift())
grps = (~y_chk & ~z_chk).cumsum()
df['w'] = df.groupby(grps).x.transform(pd.Series.head, n=1)
df
Upvotes: 2