Reputation: 255
I have a dataframe that looks like this.
code description col3 col4
123456 nice shoes size4 something something
123456 nice shoes size5 something something
567890 boots size 1 something something
567890 boots size 2 something something
567890 boots size 3 something something
234567 baby overall 2yrs something. something
234567 baby overall 3-4yrs something something
456778 shirt m Something. Something
456778 shirt l something Something
456778 shirt xl Something Something
I like to shorten 'description' to be the common substring based on similar 'code' column. and drop duplicates.
code description col3 col4
123456 nice shoes something something
567890 boots something something
234567 baby overall something something
456778 shirt Something Something
I Suspect need to groupby and maybe apply a function but not able to get my head round this. Found a function but that takes in 2 strings. Not sure if it could be of help. And this function only takes 2 strings whereas my data may have 5 rows having same code...
from difflib import SequenceMatcher
string1 = "apple pie available"
string2 = "come have some apple pies"
def extract_common(string1, string2):
match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
print(match) # -> Match(a=0, b=15, size=9)
print(string1[match.a: match.a + match.size]) # -> apple pie
print(string2[match.b: match.b + match.size]) # -> apple pie
return string1[match.a: match.a + match.size]
Appreciate any help rendered.
Upvotes: 1
Views: 669
Reputation: 30930
You need pandas 0.25.1 to use explode
mask=(df.groupby('code')['code'].transform('size')>1)
df1=df[mask]
df2=df[~mask]
s=df1.groupby('code',sort=False)['description'].apply(lambda x: ' '.join(x).split(' ')).explode()
s_not_duplicates=s.to_frame()[s.map(s.value_counts()>1)].drop_duplicates().groupby(level=0)['description'].apply(lambda x: ' '.join(x))
description_not_duplicates=pd.concat([s_not_duplicates,df2.description])
print(description_not_duplicates)
123456 nice shoes
234567 baby overall
456778 shirt
567890 boots size
Name: description, dtype: object
Upvotes: 1