Reputation: 972
I have a very large dataframe that I am trying to clean up. Here is a sample of the dataframe.
[in]: dftrain
[out]:
date store_nbr item_nbr units tavg preciptotal
2012-01-01 1 1 0 42 0.05
2012-01-01 1 2 2 42 0.05
2012-01-01 1 3 0 42 0.05
2012-01-01 1 4 0 42 0.05
2012-01-01 2 1 0 62 0.15
2012-01-01 2 2 0 62 0.15
2012-01-01 2 3 20 62 0.15
2012-01-01 2 4 1 62 0.15
2012-01-01 3 1 0 72 1.05
2012-01-01 3 2 0 72 1.05
2012-01-01 3 3 29 72 1.05
2012-01-01 3 3 0 72 1.05
...(New Day)
date store_nbr item_nbr units tavg preciptotal
2012-01-02 1 1 0 47 0.10
2012-01-02 1 2 0 47 0.10
2012-01-02 1 3 0 47 0.10
2012-01-02 1 4 0 47 0.10
2012-01-02 2 1 0 60 0.0
2012-01-02 2 2 0 60 0.0
2012-01-02 2 3 5 60 0.0
2012-01-02 2 4 0 60 0.0
2012-01-02 3 1 0 67 .23
2012-01-02 3 2 0 67 .23
2012-01-02 3 3 2 67 .23
2012-01-02 3 3 0 67 .23
...
I have a dictionary with the str_nbr as the key and item_nbr as the values of the store and item numbers I want to keep. Here is a sample dictionary: {'1': [2], '2': [3,4], '3': [3]}
What I want to do is go store by store and delete all the rows with the item_nbr NOT contained as the values of my dictionary. So in this example when I have dropped the undesired rows I would have this...
[in]: dfrain_drop
[out]:
date store_nbr item_nbr units tavg preciptotal
2012-01-01 1 2 2 42 0.05
2012-01-01 2 3 20 62 0.15
2012-01-01 2 4 1 62 0.15
2012-01-01 3 3 29 72 1.05
...(New Day)
date store_nbr item_nbr units tavg preciptotal
2012-01-02 1 2 0 47 0.10
2012-01-02 2 3 5 60 0.0
2012-01-02 2 4 0 60 0.0
2012-01-02 3 3 0 67 .23
Upvotes: 0
Views: 324
Reputation: 690
This is the fastest solution I could come up with:
is_good = lambda group: group.isin(keep[group.name])
result = df[df.groupby('store_nbr')['item_nbr'].apply(is_good)]
It splits up the DataFrame by store_nbr, calls is_good on each each group (apply) to determine the rows you want to keep, puts everything back together in the right order, and then takes a subset of rows from the original frame.
Upvotes: 1
Reputation: 16249
Using the first day of your sample dataset:
keep = {'1': [2], '2': [3,4], '3': [3]}
parts = []
for i in keep: parts.append(dftrain.query(
"item_nbr in %s and store_nbr == %d"%(keep[i],int(i))
))
pd.concat(parts)
date store_nbr item_nbr units tavg preciptotal 1 2012-01-01 1 2 2 42 0.05 10 2012-01-01 3 3 29 72 1.05 11 2012-01-01 3 3 0 72 1.05 6 2012-01-01 2 3 20 62 0.15 7 2012-01-01 2 4 1 62 0.15
result, all one tidied-up dataset. ( I believe your sample output is missing an item from store 3.)
Upvotes: 0