Hound
Hound

Reputation: 972

Deleting Specific Rows from Pandas Dataframe

I have a very large dataframe that I am trying to clean up. Here is a sample of the dataframe.

[in]: dftrain

[out]:
    date    store_nbr   item_nbr    units       tavg    preciptotal
2012-01-01       1          1          0         42         0.05
2012-01-01       1          2          2         42         0.05
2012-01-01       1          3          0         42         0.05 
2012-01-01       1          4          0         42         0.05 

2012-01-01       2          1          0         62         0.15
2012-01-01       2          2          0         62         0.15
2012-01-01       2          3          20        62         0.15
2012-01-01       2          4          1         62         0.15

2012-01-01       3          1          0         72         1.05
2012-01-01       3          2          0         72         1.05
2012-01-01       3          3          29        72         1.05
2012-01-01       3          3          0         72         1.05
...(New Day) 
  date      store_nbr   item_nbr     units      tavg    preciptotal
2012-01-02       1          1          0         47         0.10
2012-01-02       1          2          0         47         0.10
2012-01-02       1          3          0         47         0.10 
2012-01-02       1          4          0         47         0.10 

2012-01-02       2          1          0         60         0.0
2012-01-02       2          2          0         60         0.0
2012-01-02       2          3          5         60         0.0
2012-01-02       2          4          0         60         0.0

2012-01-02       3          1          0         67         .23
2012-01-02       3          2          0         67         .23
2012-01-02       3          3          2         67         .23
2012-01-02       3          3          0         67         .23
...

I have a dictionary with the str_nbr as the key and item_nbr as the values of the store and item numbers I want to keep. Here is a sample dictionary: {'1': [2], '2': [3,4], '3': [3]} What I want to do is go store by store and delete all the rows with the item_nbr NOT contained as the values of my dictionary. So in this example when I have dropped the undesired rows I would have this...

 [in]: dfrain_drop

[out]:
    date    store_nbr   item_nbr    units       tavg    preciptotal

2012-01-01       1          2          2         42         0.05

2012-01-01       2          3          20        62         0.15
2012-01-01       2          4          1         62         0.15

2012-01-01       3          3          29        72         1.05

...(New Day) 
  date      store_nbr   item_nbr     units      tavg    preciptotal

2012-01-02       1          2          0         47         0.10

2012-01-02       2          3          5         60         0.0
2012-01-02       2          4          0         60         0.0

2012-01-02       3          3          0         67         .23

Upvotes: 0

Views: 324

Answers (2)

Evan Wright
Evan Wright

Reputation: 690

This is the fastest solution I could come up with:

is_good = lambda group: group.isin(keep[group.name])
result = df[df.groupby('store_nbr')['item_nbr'].apply(is_good)]

It splits up the DataFrame by store_nbr, calls is_good on each each group (apply) to determine the rows you want to keep, puts everything back together in the right order, and then takes a subset of rows from the original frame.

Upvotes: 1

cphlewis
cphlewis

Reputation: 16249

Using the first day of your sample dataset:

keep = {'1': [2], '2': [3,4], '3': [3]}

parts = []
for i in keep: parts.append(dftrain.query(
               "item_nbr in %s and store_nbr == %d"%(keep[i],int(i))
               ))
pd.concat(parts)
          date  store_nbr  item_nbr  units  tavg  preciptotal
1   2012-01-01          1         2      2    42         0.05
10  2012-01-01          3         3     29    72         1.05
11  2012-01-01          3         3      0    72         1.05
6   2012-01-01          2         3     20    62         0.15
7   2012-01-01          2         4      1    62         0.15

result, all one tidied-up dataset. ( I believe your sample output is missing an item from store 3.)

Upvotes: 0

Related Questions