learner
learner

Reputation: 877

In Pandas dataframe remove duplicate strings in a list column and remove the corresponding ids in the other list column

I have Pandas dataframe like this:

df =
        A                              B

       [1, 5, 8, 10]                 [str1, str_to_be_removed*, str_to_be_removed*, str2]
       [4, 7, 10]                    [str2, str5, str10]
       [5, 2, 7, 9, 15]              [str6, str2, str_to_be_removed*, str_to_be_removed*, str_to_be_removed*]
         ...                     ...

Given, str_to_be_removed, I would like to keep only the 1st instance of the string which contains str_to_be_removed, in column B and remove the other ones. In addition, I would also like to remove the corresponding ids from A. Size of lists contained in A and B in each row is the same.

How to do this?

Desired Output:

       df = 

       A                           B
      
      [1, 5, 10]                  [str1, str_to_be_removed*, str2]
      [4, 7, 10]                  [str2, str5, str10]
      [5, 2, 7]                   [str6, str2, str_to_be_removed*]

EDIT:

So, this is the sample df:

   df =
       df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5]],[10,11,12]\
               'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],['str_to_be_removed_bla','b'],['b','c','d'],\
                    ['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']\
                   ]})


  Desired Output:

    df =
    
    A                      B

    [1]                    [str_to_be_removed_bla]
    [2,3]                  [str_to_be_removed_bla, b]
    [3,4,5]                [b, c, d]
    [10, 12]               [str_to_be_removed_bla_bla,f]

Upvotes: 1

Views: 1058

Answers (4)

Ferris
Ferris

Reputation: 5601

steps:

  1. use zip to combine relate elements together.
  2. unnest the zip list, use explode
  3. drop duplicates, keep first
  4. goupby index, agg as list.
df = pd.DataFrame({'A':[
            [1,2],[2,3],[3,4,5],[10, 11, 12] 
        ],
       'B':[['str_to_be_removed_bla','str_to_be_removed_bla'], 
            ['str_to_be_removed_bla','b'],
            ['b', 'c', 'd'],
            ['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']
           ]})

# zip and explode step 
# df_obj = df.apply(lambda x: list(zip(x.A, x.B)),
#         axis = 1
#        ).to_frame()
# df_obj = df_obj.explode(0).reset_index()
# df_obj['A'] = df_obj[0].str[0]
# df_obj['B'] = df_obj[0].str[1]

update with @Joe Ferndz's solution, simplify the steps.

# explode the columns
dfTemp = df.apply(lambda x: x.explode())
df_obj = dfTemp.reset_index()

# can add more conditions to filter and handle to drop_duplicates
cond = df_obj['B'].astype(str).str.contains("str_to_be_removed")
df1 = df_obj[cond].drop_duplicates(['index'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
result = df3.groupby('index').agg({'A':list, 'B':list})

result

               A                               B
index                                           
0            [1]         [str_to_be_removed_bla]
1         [2, 3]      [str_to_be_removed_bla, b]
2      [3, 4, 5]                       [b, c, d]
3       [10, 12]  [str_to_be_removed_bla_bla, f]

extend this solution to multiple patterns in 1 step: str_to_be_removed1, str_to_be_removed2,

df = pd.DataFrame({'A':[
            [1,2],[2,3],[3,4,5, 6],[10, 11, 12] 
        ],
       'B':[['str_to_be_removed1_bla','str_to_be_removed1_bla'], 
            ['str_to_be_removed1_bla','b'],
            ['b', 'b', 'c', 'd'],
            ['str_to_be_removed2_bla_bla','str_to_be_removed2_bla','f']
           ]})
print(df)

#               A                                                        B
# 0        [1, 2]         [str_to_be_removed1_bla, str_to_be_removed1_bla]
# 1        [2, 3]                              [str_to_be_removed1_bla, b]
# 2  [3, 4, 5, 6]                                             [b, b, c, d]
# 3  [10, 11, 12]  [str_to_be_removed2_bla_bla, str_to_be_removed2_bla, f]
df_obj = df.apply(lambda x: x.explode()).reset_index()
df_obj['tag'] = df_obj['B'].str.extract(r'(str_to_be_removed1|str_to_be_removed2)')
print(df_obj)

#     index   A                           B                 tag
# 0       0   1      str_to_be_removed1_bla  str_to_be_removed1
# 1       0   2      str_to_be_removed1_bla  str_to_be_removed1
# 2       1   2      str_to_be_removed1_bla  str_to_be_removed1
# 3       1   3                           b                 NaN
# 4       2   3                           b                 NaN
# 5       2   4                           b                 NaN
# 6       2   5                           c                 NaN
# 7       2   6                           d                 NaN
# 8       3  10  str_to_be_removed2_bla_bla  str_to_be_removed2
# 9       3  11      str_to_be_removed2_bla  str_to_be_removed2
# 10      3  12                           f                 NaN

cond = df_obj['tag'].notnull()
df1 = df_obj[cond].drop_duplicates(['index', 'tag'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
print(df3.groupby('index').agg({'A':list, 'B':list}))
#                   A                                B
# index                                               
# 0               [1]         [str_to_be_removed1_bla]
# 1            [2, 3]      [str_to_be_removed1_bla, b]
# 2      [3, 4, 5, 6]                     [b, b, c, d]
# 3          [10, 12]  [str_to_be_removed2_bla_bla, f]

Upvotes: 2

Joe Ferndz
Joe Ferndz

Reputation: 8508

Here's an another approach to do this. Here' you can give the search string and it will remove it from the list.

import pandas as pd
pd.set_option('display.max_colwidth', 500)

df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5],[10,11,12]],
                   'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
                        ['str_to_be_removed_bla','b'],
                        ['b','c','d'],
                        ['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']]})
print (df)

search_str = 'str_to_be_removed'

#explode each column into its own rows
dfTemp = df.apply(lambda x: x.explode())

#flag columns that contains search string
dfTemp['Found'] = dfTemp.B.str.contains(search_str)

#cumcount by Index and by search strings to get duplicates
dfTemp['Bx'] = dfTemp.groupby([dfTemp.index,'Found']).cumcount()

#Exclude all records where search string was found and count is more than 1
dfTemp = dfTemp[~(dfTemp['Found'] & dfTemp['Bx'] > 0)]

#Extract back all the records into the dataframe and store as list
df_final = dfTemp.groupby(dfTemp.index).agg({'A':list, 'B':list})

del dfTemp

print (df_final)

The output of this is:

Original dataframe:

              A                                                      B
0        [1, 2]         [str_to_be_removed_bla, str_to_be_removed_bla]
1        [2, 3]                             [str_to_be_removed_bla, b]
2     [3, 4, 5]                                              [b, c, d]
3  [10, 11, 12]  [str_to_be_removed_bla_bla, str_to_be_removed_bla, f]

search string: 'str_to_be_removed'

Final dataframe:

           A                               B
0        [1]         [str_to_be_removed_bla]
1     [2, 3]      [str_to_be_removed_bla, b]
2  [3, 4, 5]                       [b, c, d]
3   [10, 12]  [str_to_be_removed_bla_bla, f]

Upvotes: 2

akuiper
akuiper

Reputation: 214927

Don't think a native pandas solution is possible for this case. And even there is, you are not likely to have a performance gain with it. A traditional for loop might work better for your case:

def remove_dupes(a, b, dupe_pattern):
  seen_dupe = False
  _a, _b = [], []
  for x, y in zip(a, b):
    if dupe_pattern in y:
      if not seen_dupe:
        seen_dupe = True
        _a.append(x)
        _b.append(y)
    else:
      _a.append(x)
      _b.append(y)
  return _a, _b

_A, _B = [], []

for a, b in zip(df.A, df.B):
  _a, _b = remove_dupes(a, b, 'str_to_be_removed')
  _A.append(_a)
  _B.append(_b)

df['A'], df['B'] = _A, _B
print(df)

#           A                               B
#0        [1]         [str_to_be_removed_bla]
#1     [2, 3]      [str_to_be_removed_bla, b]
#2  [3, 4, 5]                       [b, c, d]
#3   [10, 12]  [str_to_be_removed_bla_bla, f]

Try run it here.

Upvotes: 1

BENY
BENY

Reputation: 323226

Let us try dict

out = pd.DataFrame([[list(dict(zip(y[::-1],x[::-1])).values())[::-1],list(dict(zip(y[::-1],x[::-1])).keys())[::-1]]  for x , y in zip(df.A,df.B)])
out
        0       1
0     [1]     [a]
1  [2, 2]  [a, b]

Sample dataframe

df = pd.DataFrame({'A':[[1,2],[2,2]],'B':[['a','a'],['a','b']]})

Upvotes: 1

Related Questions