Reputation: 881
I have created a dataframe:
[in] testing_df =pd.DataFrame(test_array,columns=['transaction_id','product_id'])
# Split the product_id's for the testing data
testing_df.set_index(['transaction_id'],inplace=True)
testing_df['product_id'] = testing_df['product_id'].apply(lambda row: row.split(','))
[out] product_id
transaction_id
001 [P01]
002 [P01, P02]
003 [P01, P02, P09]
004 [P01, P03]
005 [P01, P03, P05]
006 [P01, P03, P07]
007 [P01, P03, P08]
008 [P01, P04]
009 [P01, P04, P05]
010 [P01, P04, P08]
How can I now remove 'P04' and 'P08' from the results?
I tried:
# Remove P04 and P08 from consideration
testing_df['product_id'] = testing_df['product_id'].map(lambda x: x.strip('P04'))
testing_df['product_id'].replace(regex=True,inplace=True,to_replace=r'P04,',value=r'')
However, neither option seems to work.
The datatypes are:
[in] print(testing_df.dtypes)
[out] product_id object
dtype: object
[in] print(testing_df['product_id'].dtypes)
[out] object
Upvotes: 2
Views: 6517
Reputation: 164613
A list comprehension will likely be most efficient:
exc = {'P04', 'P08'}
df['product_id'] = [[i for i in L if i not in exc] for L in df['product_id']]
Note that an inefficient Python-level loop is unavoidable. apply
+ lambda
, map
+ lambda
or an in-place solution all involve a Python-level loop.
Upvotes: 0
Reputation: 210812
I would do it before splitting:
Data:
In [269]: df
Out[269]:
product_id
transaction_id
1 P01
2 P01,P02
3 P01,P02,P09
4 P01,P03
5 P01,P03,P05
6 P01,P03,P07
7 P01,P03,P08
8 P01,P04
9 P01,P04,P05
10 P01,P04,P08
Solution:
In [271]: df['product_id'] = df['product_id'].str.replace(r'\,*?(?:P04|P08)\,*?', '') \
.str.split(',')
In [272]: df
Out[272]:
product_id
transaction_id
1 [P01]
2 [P01, P02]
3 [P01, P02, P09]
4 [P01, P03]
5 [P01, P03, P05]
6 [P01, P03, P07]
7 [P01, P03]
8 [P01]
9 [P01, P05]
10 [P01]
alternatively you can change:
testing_df['product_id'] = testing_df['product_id'].apply(lambda row: row.split(','))
with:
testing_df['product_id'] = testing_df['product_id'].apply(lambda row: list(set(row.split(','))- set(['P04','P08'])))
Demo:
In [280]: df.product_id.apply(lambda row: list(set(row.split(','))- set(['P04','P08'])))
Out[280]:
transaction_id
1 [P01]
2 [P01, P02]
3 [P09, P01, P02]
4 [P01, P03]
5 [P01, P03, P05]
6 [P07, P01, P03]
7 [P01, P03]
8 [P01]
9 [P01, P05]
10 [P01]
Name: product_id, dtype: object
Upvotes: 2
Reputation: 1191
store all your elements to be removed in a list.
remove_results = ['P04','P08']
for k in range(len(testing_df['product_id'])):
for r in remove_results:
if r in testing_df['product_id'][k]:
testing_df['product_id][k].remove(r)
Upvotes: 1