Sumana
Sumana

Reputation: 11

I want to compare word pair in panda data frame

Names
['abc aa','bdc sc','abc aa','bdc sp','bdc sc','pp sc','bdc sc',]
['lp aa','bd sc','bdc sc','bd sc','lp aa','bd sc']

['nn aa','bb sc','bb sc','nn aa','bd sc']

I tried as

def drop_dupli(text):
    #seen = set()
    result = []
    
    for item in text.split(): 
        if item not in seen:
            seen.add(item)
            result. Append(item)
    return " ".join(result)
df['newame'] = df['Names'].apply(lambda x: drop_dupli(x))

The result came as follows:

Names
['abc aa','bdc sc','abc ','bdc sp','bdc ','pp sc','bdc ',]
['lp aa','bd sc','bdc sc','bd ','lp ','bd ']

['nn aa','bb sc','bb ','nn ','bd ']

But , I want to get the result should come as follows:

Names
['abc aa','bdc sc','bdc sp','pp sc']
['lp aa','bd sc','bdc sc']

['nn aa','bb sc','bd sc']

Upvotes: 1

Views: 31

Answers (1)

jezrael
jezrael

Reputation: 862771

Use dict.fromkeys trick for remove duplicates in original order:

df['newame'] = df['Names'].apply(lambda x: list(dict.fromkeys(x)))
print (df)
                                               Names  \
0  [abc aa, bdc sc, abc aa, bdc sp, bdc sc, pp sc...   
1        [lp aa, bd sc, bdc sc, bd sc, lp aa, bd sc]   
2                [nn aa, bb sc, bb sc, nn aa, bd sc]   

                            newame  
0  [abc aa, bdc sc, bdc sp, pp sc]  
1           [lp aa, bd sc, bdc sc]  
2            [nn aa, bb sc, bd sc]  

because if use sets order is changed:

df['newame'] = df['Names'].apply(lambda x: list(set(x)))
print (df)
                                               Names  \
0  [abc aa, bdc sc, abc aa, bdc sp, bdc sc, pp sc...   
1        [lp aa, bd sc, bdc sc, bd sc, lp aa, bd sc]   
2                [nn aa, bb sc, bb sc, nn aa, bd sc]   

                            newame  
0  [pp sc, bdc sp, bdc sc, abc aa]  
1           [lp aa, bd sc, bdc sc]  
2            [bb sc, nn aa, bd sc]  

Upvotes: 1

Related Questions