Matching values of two columns and return list of index positions

Question

I have a data set in which I am comparing each value of column1 to all values of column2. I am able to create a binary variable for each row noting if indeed the column1 value is found somewhere in column2.

I would like to now create a column that is a list of all index positions where the column1 value was found in the column 2 value. Working Python 3.6

import pandas as pd
import numpy as np

data = [{'column1': 'ibm', 'column2': 'apple'},
    {'column1': 'microsoft', 'column2': 'ibm'},
    {'column1': 'apple', 'column2': 'ibm'},
    {'column1': 'apple', 'column2': 'microsoft'},
    {'column1': 'yahoo', 'column2': 'microsoft'}]

data_df = pd.DataFrame(data)

data_df['match'] = np.where((data_df.column1.isin(data_df['column2'])), 1, 0)

This result is correct for this portion.

   split1     split2      match
0   ibm        apple        1
1   microsoft  ibm          1
2   apple      ibm          1
3   apple      microsoft    1
4   yahoo      microsoft    0

To create the index position list for each value in column1 found in column2 I have tried this:

data_df['indices'] = [i for i, x in enumerate(data_df['column2']) if x == np.where((data_df.column1.isin(data_df['column2'])))]

However, I get the following error:

data_df['indices'] = [i for i, x in enumerate(data_df['split2']) if x == np.where((data_df.split1.isin(data_df['split2'])))]
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/frame.py", line 3119, in __setitem__
self._set_item(key, value)
  File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/frame.py", line 3194, in _set_item
value = self._sanitize_column(key, value)
  File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/frame.py", line 3391, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
  File "/home/carterrees/PycharmProjects/data_services_predictopotamus/venv_predictopotamus36/lib64/python3.6/site-packages/pandas/core/series.py", line 4001, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

What I am hoping to see is this:

      split1     split2    match  indices
0      ibm        apple      1     1,2
1      microsoft  ibm        1     3,4
2      apple      ibm        1      0
3      apple      microsoft  1      0
4      yahoo      microsoft  0      Nan

cs95 · Accepted Answer

You can efficiently construct the "indices" column by first creating a dictionary mapping companies to the index, followed by simply querying the dictionary through a linear scan of "column1".

After that, you can derive the "match" column from "indices".

from collections import defaultdict

d = defaultdict(list)
for i, company in enumerate(df['column2']):
    d[company].append(str(i))

d
# defaultdict(list, {'apple': ['0'], 'ibm': ['1', '2'], 'microsoft': ['3', '4']})

# Now comes the fun part.
idx_mapping = {k: ','.join(v) for k, v in d.items()}
df['indices'] = [idx_mapping.get(x, np.nan) for x in df['column1']]
df['match'] = df['indices'].notna()
df

     column1    column2  match indices
0        ibm      apple   True     1,2
1  microsoft        ibm   True     3,4
2      apple        ibm   True       0
3      apple  microsoft   True       0
4      yahoo  microsoft  False     NaN

Matching values of two columns and return list of index positions

Answers (2)

Related Questions