Reputation: 23
Given a pandas.DataFrame
:
| Col_A | Col_B | Other_Columns
0 | A0 | B0 | …
1 | A1 | B1 | …
2 | A2 | B2 | …
3 | A3 | B3 | …
… | … | … | …
I have been trying to keep a subset of this DataFrame
: getting rid of rows where both A
and B
entries are unique (e.g. if on row 6 say, both values A6
and B6
do not appear anywhere else on the DataFrame
, we want to remove this row)
I would not like to drop duplicates. Also, I would not like to get unique values (which, if I understand well would be similar to converting a list to a set, am I right?) but instead the values that appear once only.
At this point, this is what I have got:
counts = df[['Col_A','Col_B']].stack().value_counts(ascending=True)
myList = [0] + [item for item in counts[counts.values == 1].index]
toRemove = []
for i in df.index:
if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList:
toRemove.append(i)
final_df = df[~df.index.isin(toRemove)]
This is not very efficient (the data frame is pretty large >10M lines) There must be a more pythonic strategy, involving builtin features of pandas, right? Also, I am not too sure the first line is correct: by stacking both columns, am I making sure to perform the count over the entries of both columns?
Please do not hesitate if you need more information or if my writing is not clear to you.
Many thanks for taking the time :-)
Upvotes: 2
Views: 216
Reputation: 210842
what about something like this:
In [75]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('abc'))
In [76]: df
Out[76]:
a b c
0 37 85 17
1 19 0 11
2 51 20 65
3 59 92 65
4 48 15 91
5 21 50 44
6 61 94 49
7 51 6 88
8 89 72 40
9 5 51 79
In [77]: c = df[['a','b']].stack().value_counts()
In [78]: c
Out[78]:
51 3
94 1
15 1
37 1
6 1
72 1
50 1
21 1
5 1
48 1
61 1
19 1
20 1
85 1
89 1
59 1
92 1
0 1
dtype: int64
In [79]: c[c>1]
Out[79]:
51 3
dtype: int64
In [80]: vals = c[c>1].index
In [81]: df[(df['a'].isin(vals)) | (df['b'].isin(vals))]
Out[81]:
a b c
2 51 20 65
7 51 6 88
9 5 51 79
UPDATE:
when you do the if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList:
check you are checking it not quite correctly...
here is what is happening:
In [90]: df.at[0, 'a'], df.at[0, 'b']
Out[90]: (37, 85)
In [91]: (df.at[0, 'a'] and df.at[0, 'b'])
Out[91]: 85
So you can't check it this way
Upvotes: 1