Reputation: 25
I have two columns with lists of equal lengths, but the length of both lists may differ per row. I want to create col 3 which is the number corresponding to the max number of duplicates created by pairing the numbers by index between col 1 and col 2. For example:
df=
col 1 col 2 col 3
["c","c"] ["d", "d"] 2 // ("c","d") is repeated twice
["a","b","c","a"] ["f","e","e","f"] 2 //("a","f") is repeated twice while ("b","e") and ("c","e") are only repeated once
["a","b","g"] ["f","e","f"] 1 //("a","f"), ("b","e"), and ("g","f") are repeated once
What I've tried so far:
My approach thus far has been to first create a new column with a list of tuples. Taking the first row in my example, I wanted to create [("c","d"), ("c","d")] and repeat this for every column. Then I planned to apply Counter(df["col 3"]).most_common(1)[0][1]
to get the max number of duplicated pairs.
To create the column with lists of tuples I tried:
df["col 3"] = list(zip(df["col 1"],df["col 2"]))
but this seems to return (["c","c"],["d","d"]) using the first row as an example instead of [("c","d"),("c","d")]
Any help would be much appreciated!
Upvotes: 0
Views: 95
Reputation: 28644
You could also try with a list comprehension and numpy :
df['col3'] = [np.max(np.unique(tuple(zip(*entry)),
axis=0,
return_counts=True)[-1])
for entry in zip(df.col1, df.col2)
]
Upvotes: 0
Reputation: 13255
Try apply
with Counter
:
df['col 3'] = df.apply(lambda x: np.max(Counter(zip(x['a'], x['b'])).values()), axis=1)
Upvotes: 1