Reputation: 45
I have the following dataframe:
df = pd.DataFrame({'TX':['bob','tim','frank'],'IL':['fred','bob','tim'],'NE':['tim','joe','bob']})
I would like to isolate the strings that occur across all columns to generate a list. The expected result is:
output = ['tim','bob']
The only way I can think to achieve this is using for loops which I would like to avoid. Is there a built-in pandas function suited to accomplishing this?
Upvotes: 3
Views: 3570
Reputation: 862671
You can create mask for count values per columns and test if not missing values per rows by DataFrame.all
:
m = df.apply(pd.value_counts).notna()
print (m)
TX IL NE
bob True True True
frank True False False
fred False True False
joe False False True
tim True True True
L = m.index[m.all(axis=1)].tolist()
print (L)
['bob', 'tim']
Upvotes: 8
Reputation: 23099
IIUC,
you can stack all your columns vertically and then do a value_counts
to count the occurrences of each item, we'll do that in the variable called s
we then want all occurrences of the names which are equal to the max
number of occurrences, in this instance 3, the column values are now indices thanks to using stack
s = df.stack().value_counts()
# or if you want to ignore duplicates column wise
#df.stack().groupby(level=1).unique().explode().value_counts()
print(s)
tim 3
bob 3
frank 1
fred 1
joe 1
s1 = s[s.eq(s.max())].index.tolist()
print(s1)
['tim', 'bob']
Upvotes: 1
Reputation: 7625
You can achieve this by pandas.DataFrame.apply()
and set.intersection()
, like this:
cols_set = list(df.apply(lambda col: set(col.values)).values)
output = list(set.intersection(*cols_set))
The result is following:
>>> print(output)
['tim', 'bob']
Upvotes: 1