J T
J T

Reputation: 255

pandas extract common substring from 1 column based on conditions in another column

I have a dataframe that looks like this.

code    description          col3        col4
123456  nice shoes size4     something   something
123456  nice shoes size5     something   something
567890  boots size 1         something   something
567890  boots size 2         something   something
567890  boots size 3         something   something
234567 baby overall 2yrs     something.  something
234567 baby overall 3-4yrs     something  something
456778 shirt m     Something.   Something
456778 shirt l     something    Something
456778 shirt xl    Something   Something

I like to shorten 'description' to be the common substring based on similar 'code' column. and drop duplicates.

code    description          col3        col4
123456  nice shoes          something   something
567890  boots               something   something
234567 baby overall    something    something
456778 shirt              Something   Something

I Suspect need to groupby and maybe apply a function but not able to get my head round this. Found a function but that takes in 2 strings. Not sure if it could be of help. And this function only takes 2 strings whereas my data may have 5 rows having same code...

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

def extract_common(string1, string2):
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

    print(match)  # -> Match(a=0, b=15, size=9)
    print(string1[match.a: match.a + match.size])  # -> apple pie
    print(string2[match.b: match.b + match.size])  # -> apple pie
    return string1[match.a: match.a + match.size]

Appreciate any help rendered.

Upvotes: 1

Views: 669

Answers (1)

ansev
ansev

Reputation: 30930

You need pandas 0.25.1 to use explode

mask=(df.groupby('code')['code'].transform('size')>1)
df1=df[mask]
df2=df[~mask]
s=df1.groupby('code',sort=False)['description'].apply(lambda x: ' '.join(x).split(' ')).explode()
s_not_duplicates=s.to_frame()[s.map(s.value_counts()>1)].drop_duplicates().groupby(level=0)['description'].apply(lambda x: ' '.join(x))
description_not_duplicates=pd.concat([s_not_duplicates,df2.description])
print(description_not_duplicates)

123456      nice shoes
234567    baby overall
456778           shirt
567890      boots size
Name: description, dtype: object

Upvotes: 1

Related Questions