A. Sarid
A. Sarid

Reputation: 3996

Pandas: delete or change specific rows according to multiple columns values

I have a dictionary between tuples to list of values, for example:

{
    ('book1', 'US'): [1, 5],
    ('book2', 'CA'): [3]
}

The tuple represents columns values in pandas DataFrame (there might be other columns as well) and the list represents values in a single column in the same dataframe.
Let's say my dataframe looks something like this:

    book country  value
0  book1      US      1
1  book1      US      9
2  book1      US      5
3  book2      MX      7
4  book2      CA      3
5  book1      CA      1

I would like to either delete the rows that the dict above represents or change the value of those rows to 0.
So the result will be either:

    book country  value
0  book1      US      0
1  book1      US      9
2  book1      US      0
3  book2      MX      7
4  book2      CA      0
5  book1      CA      1

Or:

    book country  value
0  book1      US      9
1  book2      MX      7
2  book1      CA      1

What is the best approach for doing this?
I want to do this on a pretty large dataframe and it should be as efficient as possible.

My idea was doing something like this, but it doesn't seem very efficient (due to the multiple loc) and I get duplicates rows instead of removing the rows. (I don't want to use drop_duplicates as there might be duplicates at first place that I don't want to drop)

data_frame.set_index(['book', 'country'], inplace=True)
for key, values in rows_to_remove.iteritems():
    data_frame.loc[key] = data_frame.loc[key][~data_frame.loc[key]['value'].isin(values)]
data_frame.reset_index(inplace=True)

Upvotes: 1

Views: 163

Answers (1)

jezrael
jezrael

Reputation: 862591

You can create list of tuples and check membership by Index.isin with boolean indexing:

d = {
    ('book1', 'US'): [1, 5],
    ('book2', 'CA'): [3]
}

tups = [k + (x, ) for k, v in d.items() for x in v]

df = df[~df.set_index(['book','country','value']).index.isin(tups)]
print (df)
    book country  value
1  book1      US      9
3  book2      MX      7
5  book1      CA      1

For set 0 by condition use loc:

df.loc[df.set_index(['book','country','value']).index.isin(tups), 'value'] = 0
print (df)
    book country  value
0  book1      US      0
1  book1      US      9
2  book1      US      0
3  book2      MX      7
4  book2      CA      0
5  book1      CA      1

Another solution:

tups = [k + (x, ) for k, v in d.items() for x in v]

df1 = pd.DataFrame(tups, columns=['book','country','value'])

df2 = pd.concat([df, df1, df1], ignore_index=True)
df = df2[~df2.duplicated(keep=False)]
print (df)
    book country  value
1  book1      US      9
3  book2      MX      7
5  book1      CA      1

Upvotes: 2

Related Questions