The Great
The Great

Reputation: 7723

Pandas filter list of list values in a dataframe column

I have a dataframe like as below

sample_df = pd.DataFrame({'single_proj_name': [['jsfk'],['fhjk'],['ERRW'],['SJBAK']],
                              'single_item_list': [['ABC_123'],['DEF123'],['FAS324'],['HSJD123']],
                              'single_id':[[1234],[5678],[91011],[121314]],
                              'multi_proj_name':[['AAA','VVVV','SASD'],['QEWWQ','SFA','JKKK','fhjk'],['ERRW','TTTT'],['SJBAK','YYYY']],
                              'multi_item_list':[[['XYZAV','ADS23','ABC_123'],['ABC_123','ADC_123']],['XYZAV','DEF123','ABC_123','SAJKF'],['QWER12','FAS324'],['JFAJKA','HSJD123']],
                              'multi_id':[[[2167,2147,29481],[5432,1234]],[2313,57567,2321,7898],[1123,8775],[5237,43512]]})

I would like to do the below

a) Pick the value from single_item_list for each row

b) search that value in multi_item_list column of the same row. Please note that it could be list of lists for some of the rows

c) If match found, keep only that matched values in multi_item_list and remove all other non-matching values from multi_item_list

d) Based on the position of the match item, look for corresponding value in multi_id list and keep only that item. Remove all other position items from the list

So, I tried the below but it doesn't work for nested list of lists

for a, b, c in zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id']):
    for i, x in enumerate(b):
        print(x)
        print(a[0])
        if a[0] in x:
            print(x.index(a[0]))
            pos = x.index(a[0])
            print(c[pos-1])

I expect my output to be like as below. In real world, I will have more cases like 1st input row (nested lists with multiple levels)

enter image description here

Upvotes: 1

Views: 221

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Here is one approach which works with any number of nested lists:

def func(z, X, Y):
    A, B = [], []
    for x, y in zip(X, Y):
        if isinstance(x, list):
            a, b = func(z, x, y)
            A.append(a), B.append(b)

        if x == z:
            A.append(x), B.append(y)
    return A, B


c = ['single_item_list', 'multi_item_list', 'multi_id']
df[c[1:]] = [func(z, X, Y) for [z], X, Y in df[c].to_numpy()]

Result

  single_proj_name single_item_list single_id           multi_proj_name         multi_item_list           multi_id
0           [jsfk]        [ABC_123]    [1234]         [AAA, VVVV, SASD]  [[ABC_123], [ABC_123]]  [[29481], [5432]]
1           [fhjk]         [DEF123]    [5678]  [QEWWQ, SFA, JKKK, fhjk]                [DEF123]            [57567]
2           [ERRW]         [FAS324]   [91011]              [ERRW, TTTT]                [FAS324]             [8775]
3          [SJBAK]        [HSJD123]  [121314]             [SJBAK, YYYY]               [HSJD123]            [43512]

Upvotes: 1

The Great
The Great

Reputation: 7723

I made use to isinstance to check whether it is a nested list or not and came up with something like below which results in expected output. Am open to suggestions and improvement for experts here

for i, (single, multi_item, multi_id) in enumerate(zip(sample_df['single_item_list'],sample_df['multi_item_list'],sample_df['multi_id'])):
    if (any(isinstance(i, list) for i in multi_item)) == False:
        for j, item_list in enumerate(multi_item):
            if single[0] in item_list:
                pos = item_list.index(single[0])
                sample_df.at[i,'multi_item_list'] = [item_list]
                sample_df.at[i,'multi_id'] = [multi_id[j]]
    else:
        print("under nested list")
        for j, item_list in enumerate(zip(multi_item,multi_id)):
            if single[0] in multi_item[j]:
                pos = multi_item[j].index(single[0])
                sample_df.at[i,'multi_item_list'][j] = single[0]
                sample_df.at[i,'multi_id'][j] = multi_id[j][pos]
            else:
                sample_df.at[i,'multi_item_list'][j] = np.nan
                sample_df.at[i,'multi_id'][j] = np.nan

Upvotes: 0

Related Questions