user14140004
user14140004

Reputation: 25

Create Pandas column displaying the max number of repeated pairs from two other columns

I have two columns with lists of equal lengths, but the length of both lists may differ per row. I want to create col 3 which is the number corresponding to the max number of duplicates created by pairing the numbers by index between col 1 and col 2. For example:

df=

col 1                       col 2               col 3
["c","c"]                   ["d", "d"]          2 // ("c","d") is repeated twice
["a","b","c","a"]           ["f","e","e","f"]   2 //("a","f") is repeated twice while ("b","e") and ("c","e") are only repeated once
["a","b","g"]               ["f","e","f"]       1 //("a","f"), ("b","e"), and ("g","f") are repeated once

What I've tried so far:

My approach thus far has been to first create a new column with a list of tuples. Taking the first row in my example, I wanted to create [("c","d"), ("c","d")] and repeat this for every column. Then I planned to apply Counter(df["col 3"]).most_common(1)[0][1] to get the max number of duplicated pairs.

To create the column with lists of tuples I tried: df["col 3"] = list(zip(df["col 1"],df["col 2"])) but this seems to return (["c","c"],["d","d"]) using the first row as an example instead of [("c","d"),("c","d")]

Any help would be much appreciated!

Upvotes: 0

Views: 95

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

You could also try with a list comprehension and numpy :

df['col3'] = [np.max(np.unique(tuple(zip(*entry)), 
                     axis=0, 
                     return_counts=True)[-1])
             for entry in zip(df.col1, df.col2)
              ]

Upvotes: 0

Space Impact
Space Impact

Reputation: 13255

Try apply with Counter:

df['col 3'] = df.apply(lambda x: np.max(Counter(zip(x['a'], x['b'])).values()), axis=1)

Upvotes: 1

Related Questions