UniversalTraveller
UniversalTraveller

Reputation: 23

Remove Rows from pandas.Dataframe Based on Entries

Given a pandas.DataFrame:

    | Col_A | Col_B | Other_Columns
0   | A0    | B0    | …
1   | A1    | B1    | …
2   | A2    | B2    | …
3   | A3    | B3    | …
…   | …     | …     | …

I have been trying to keep a subset of this DataFrame: getting rid of rows where both A and B entries are unique (e.g. if on row 6 say, both values A6 and B6 do not appear anywhere else on the DataFrame, we want to remove this row)

I would not like to drop duplicates. Also, I would not like to get unique values (which, if I understand well would be similar to converting a list to a set, am I right?) but instead the values that appear once only.

At this point, this is what I have got:

counts = df[['Col_A','Col_B']].stack().value_counts(ascending=True)
myList = [0] + [item for item in counts[counts.values == 1].index]
toRemove = []
for i in df.index:
    if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList:
        toRemove.append(i)
final_df = df[~df.index.isin(toRemove)]

This is not very efficient (the data frame is pretty large >10M lines) There must be a more pythonic strategy, involving builtin features of pandas, right? Also, I am not too sure the first line is correct: by stacking both columns, am I making sure to perform the count over the entries of both columns?

Please do not hesitate if you need more information or if my writing is not clear to you.

Many thanks for taking the time :-)

Upvotes: 2

Views: 216

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

what about something like this:

In [75]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('abc'))

In [76]: df
Out[76]:
    a   b   c
0  37  85  17
1  19   0  11
2  51  20  65
3  59  92  65
4  48  15  91
5  21  50  44
6  61  94  49
7  51   6  88
8  89  72  40
9   5  51  79

In [77]: c = df[['a','b']].stack().value_counts()

In [78]: c
Out[78]:
51    3
94    1
15    1
37    1
6     1
72    1
50    1
21    1
5     1
48    1
61    1
19    1
20    1
85    1
89    1
59    1
92    1
0     1
dtype: int64

In [79]: c[c>1]
Out[79]:
51    3
dtype: int64

In [80]: vals = c[c>1].index

In [81]: df[(df['a'].isin(vals)) | (df['b'].isin(vals))]
Out[81]:
    a   b   c
2  51  20  65
7  51   6  88
9   5  51  79

UPDATE:

when you do the if (df.at[i,'Col_A'] and df.at[i, 'Col_B']) in myList: check you are checking it not quite correctly...

here is what is happening:

In [90]: df.at[0, 'a'], df.at[0, 'b']
Out[90]: (37, 85)

In [91]: (df.at[0, 'a'] and df.at[0, 'b'])
Out[91]: 85

So you can't check it this way

Upvotes: 1

Related Questions