Pandas DataFrame: Replacing column values with next column values to avoid duplication

Question

I have a Pandas DataFrame with hundreds of rows and 10 columns. Each row represents a unique ID, and each column represents the k nearest neighbor index. That is, the first column would be for the index of the nearest neighbor to the ID, and the second would be the second closest nearest neighbor, etc. down to the 10th closest neighbor.

However, the first column has some duplicates since there are several IDs that share a common nearest neighbor. However, I want to find each ID's nearest neighbor index subject to having no duplication. So for example, if the first two IDs shared the nearest neighbor, then I would want to use the second column for finding the non-duplicated nearest neighbor for the second ID. For example, if my DataFrame looked like this:

         NN1        NN2        NN3      ... NN10
1       1           3          8
2       1           5          9
3       1           5          2
4       3           8          1

Then the result would be:

In my example, from what I can tell, there does not appear to be a case where after using a 10th nearest neighbor, that there is still duplicates (and if there are, I can simply increase the number of nearest neighbors I use).

matthme · Accepted Answer

This might work, although it's surely not the most elegant way:

a = pd.DataFrame(....)

used_list = []

for i in range(a.shape[0]):
    if np.isin(a.iloc[i,0],used_list):
        take_column = ~np.isin(a.iloc[i], used_list)
        a.iloc[i,0] = a.iloc[i,np.argmax(take_column)]
        
    used_list.append(a.iloc[i,0])

Pandas DataFrame: Replacing column values with next column values to avoid duplication

Answers (1)

Related Questions