Reputation: 350
I have two columns within a data frame containing strings. For example,
import pandas as pd
import numpy as np
data = [['Oct-2019', 'Oranges + Grapes + Pears', 'Grapes + Pears'],
['Nov-2019', 'Oranges + Grapes + Pears', 'Oranges + Grapes + Pears']]
df = pd.DataFrame(data, columns =['Date', 'Previous shopping list', 'Recent shopping list'])
print(df)
Fish = ['Salmon', 'Trout']
Fruit = ['Oranges', 'Grapes', 'Pears']
Date PSL RSL
0 Oct-2019 Oranges + Grapes Grapes + Pears
+ Pears + Salmon
1 Nov-2019 Oranges + Grapes Oranges + Grapes
+ Pears + Trout + Pears
I want to compare the strings in both columns and have a text output to a new column that says what has changed between the two lists. Such as, creating a column that will check for the strings related to "Fruit" and output what fruit has been dropped from the recent shopping when compared to the previous list previous shopping list. See Desired output below:
Date PSL RSL Fruit lost Fish Lost
0 Oct-2019 Oranges + Grapes Grapes + Pears Oranges Salmon
+ Pears + Salmon
1 Nov-2019 Oranges + Grapes Oranges + Grapes Trout
+ Pears + Trout + Pears
How would I be able to achieve this in using pandas! Apologies if this was not clear the first time!
Thank you for any suggestion/help!
Upvotes: 0
Views: 119
Reputation: 350
So Mark's solution works well to grab the difference between the lists
# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split()
df['RSL_processed'] = df['Recent shopping list'].str.split()
def compare_items(x):
return set(x.PSL_processed) - set(x.RSL_processed)
# add in conditional logic here, to meet specification
df['Products_lost'] = df.apply(compare_items, axis=1)
print(df)
On top to that to find the products that = fruit and the products = fish I used the following:
for idx, row in df.iterrows():
for c in Fruit:
if c in row['Products_lost']:
df.ix[idx, 'Fruit lost'] = c
for c in Fish:
if c in row['Products_lost']:
df.ix[idx, 'Fish lost'] = c
Seems to work well!
Upvotes: 0
Reputation: 376
To check if the string "Oranges" exists in "Recent shopping list" and create a new column "Oranges Lost" based on the outcome:
df['Oranges Lost'] = np.where(df['Recent shopping list'].str.contains('Oranges'), 'No Change', 'Lost')```
Upvotes: 1
Reputation: 2175
The exact function that you use to process the data depends on your exact output that you require for each combination. Hopefully below will give you enough to create a solution for your problem:
# process data so each row contains a list of elements
df['PSL_processed'] = df['Previous shopping list'].str.split('+')
df['RSL_processed'] = df['Recent shopping list'].str.split('+')
def compare_items(x):
if set(x.PSL_processed) == set(x.RSL_processed):
return 'No change'
elif set(x.PSL_processed) - set(x.CSL_processed) > 0:
return 'Lost'
# add in conditional logic here, to meet specification
df.apply(compare_items, axis=1)
The official documentation for pd.apply()
is well written.
Upvotes: 1