8ull53y3
8ull53y3

Reputation: 47

Doing a vlookup type chain in a python dataframe, which labels the number of iterations complete and publishes results in another dataframe

I'm looking to attempt a vlookup chain in a dataframe and the results be published in a new dataframe.

The actual dataframe as many columns but for this task i'm only interested in 2. The data or element placing within the dataframe is not random and is all linked together, so values in column a related to column b. My dummy data down below may not show this correctly but maybe my output may make sense.

data = {'a': [111, 112, 113, 114, 115, 215, 214, 213, 212, 211],
        'b': [112, 113, 114, 115, 116, 214, 213, 212, 211, 210]}

with the above ouput I want to first check if for example each element in column a that exists in column b is my initial search start.

If for example we use 115 as an example ( I know its not the first element, but regardless of first element I would like this result repeated for each element)

115 exists in both columns, search for it in column B, then once found, look at the value which is the same index as 115 in column b in column a which is 114, now search for 114 in column b and once found search the index of the found114 in column a which is 113 and complete until no more are found. This is the first complete iteration loop then the 2nd for example will be the same but for 215.

New df will look like the below:

new_df =  {'found': [112, 113, 114, 115, 214, 213, 212, 211],
        'iteration': [1, 1, 1, 1, 2, 2, 2, 2]}

and if possible omit semi loops so if 114 is a new starting point it will result in, 114, 113 and 112 but as it's a subset of the bigger chain from 115, I would like to omit it.

sample code below:

matches = df[df['a'].isin(df['b'])]

current_value = [df.loc[matches.index, 'a']][0]


result_dict = {'a': [], 'Chain':[]}

iteration = 1

for iteration, start_value in enumerate(starting_values, start=1):
    current_value = start_value
    visited_values = set()
    #iteration += 1
    values_in_loop = []

    while True:
        result.append(current_value)
        #iteration_results.add(iteration, start_value)
        if current_value in df['b'].values and current_value not in visited_values:
            current_index = df.index[df['b'] == current_value][0]
            current_value = df['a'].iloc[current_index]
            visited_values.add(current_value)
            values_in_loop.append(current_value)

        else:
            break
        
    iteration += 1   
    
new_df = pd.DataFrame({'found': result[::], 'Chain': iteration})

new_df

Another example I've found on a similar question on stackoverflow linked here: construct a chain using two columns in python

   b      a
Type1   Type2
Type3   Type4
Type8   Type13
Type3   Type15
Type2   Type6
Type4   Type9
Type6   Type11
Type9   Type18
Type13  Type20

out put should result in as below, that question answer is using the similar approach but applying it in a different way but a different output to what I require. Hope this helps

found  iteration
Type2     1
Type6     1
Type11    1
Type 4    2
Type 9    2
Type 18   2
Type 13   3
Type 20   3

Upvotes: 0

Views: 44

Answers (1)

mozway
mozway

Reputation: 260390

This is a graph problem, use network with weakly_connected_components, and optionally pandas.factorize:

# pip install networkx
import networkx as nx

G = nx.from_pandas_edgelist(df, source='b', target='a', create_using=nx.DiGraph)

groups = {n: i for i, g in enumerate(nx.weakly_connected_components(G), start=1)
          for n in g}

m = df['a'].isin(df['b'])

out = pd.DataFrame({'found': df.loc[m, 'a'],
                    'iteration': df.loc[m, 'a'].map(groups)
                    })
# optional
# if you must ensure that the numbers are 1 -> n without missing ones
out['iteration'] = pd.factorize(out['iteration'])[0]+1

Output:

   found  iteration
1    112          1
2    113          1
3    114          1
4    115          1
6    214          2
7    213          2
8    212          2
9    211          2

Graph:

enter image description here

Example dataset to demonstrate the effect of factorize:

data = {'a': [111, 112, 113, 114, 300, 115, 215, 214, 213, 212, 211],
        'b': [112, 113, 114, 115, 301, 116, 214, 213, 212, 211, 210]}
df = pd.DataFrame(data)

Output:

    found  iteration  iteration_factorize
1     112          1                    1
2     113          1                    1
3     114          1                    1
5     115          1                    1
7     214          3                    2
8     213          3                    2
9     212          3                    2
10    211          3                    2

Upvotes: 0

Related Questions