Reputation: 1460

How to difference a column in data frame which contains lists?

I have two data frames in which one of the column contains lists. I have to difference the columns like shown below.

DF1:

A   B
111 [12,13,14,14,15,13]
222 [15,16,17,15,17,17,17]
333 [17,14,16,14,14,17,17,16]
444 [25,26,18,12,12,12,13,18]

DF2:

A   B
111 [12,14]
222 []
333 [17,16]
444 [25,18]

Expected Output:

A   B
111 [13,15,13]
222 [15,16,17,15,17,17,17]
333 [14,14,14]
444 [26,12,12,12,13]

Upvotes: 2

Answers (5)

pylearner

Reputation: 1460

This worked for me in a less time compared to all the above

df3 = df1.merge(df2, on='A')

def set_diff(movie, all_):
    if movie is not None:
        return [item for item in movie if item not in all_]
    else:
        all_

movie_list = []
for item, row in df3.iterrows():
    movie = row['df2.B']
    all_  = row['df1.B']

    movie_list.append(set_diff(movie, all_))

Upvotes: 0

piRSquared

Reputation: 294318

Just a demonstration of the use of pipe after a merge

def f(t):
    return [i for i in t[0] if not i in t[1]]

df1.merge(df2, on='A').pipe(
    lambda d: d[['A']].assign(B=list(map(f, d.drop('A', 1).values)))
)

     A                             B
0  111                  [13, 15, 13]
1  222  [15, 16, 17, 15, 17, 17, 17]
2  333                  [14, 14, 14]
3  444          [26, 12, 12, 12, 13]

Details

# Heavy lifting for differencing
def f(t):
    return [i for i in t[0] if not i in t[1]]

# Merge the same as AmiTavory
# But then I use pipe and assign.  Dbl Brackets to keep single column
# dataframe and assign to create a new B column
# then I use the values from the merge after dropping the A column
df1.merge(df2, on='A').pipe(
    lambda d: d[['A']].assign(B=list(map(f, d.drop('A', 1).values)))
)

Upvotes: 0

penguin2048

Reputation: 1343

You should keep in mind that pandas do not store the list as "actual list" but as an object. You should always try to make the columns with atomic values instead of collections to leverage the full use of pandas capabilities. That being said, to do your desired transformation you can just convert df2's columns to set and remove all those items in corresponding columns of df1.

You need to be sure to convert the "object lists" to "actual list/set" before doing the operations.

Here is the code for this:

Method to apply on df1

def fun(x):
    # Find the list corresponding to the column A of df1 in df2
    # Use indexing to make this step faster
    remove_set = set(df2[df2['A']==x['A']].iloc[0]['B'])
    actual_list = list(x['B'])
    new_list = []
    for i in actual_list:
            if i not in remove_set:
                    new_list.append(i)
    return new_list

calling the method as

df1['B'] = df1.apply(fun, axis=1)

produces the output as

     A                             B
0  111                  [13, 15, 13]
1  222  [15, 16, 17, 15, 17, 17, 17]
2  333                  [14, 14, 14]
3  444          [26, 12, 12, 12, 13]

Note: if you can use indexing on column A then the performance of this code will greatly improve.

Upvotes: 0

Ami Tavory

Reputation: 76297

You can leverage Pandas's merge abilities, and Python's efficient set data structure.

First, merge:

df3 = df1.merge(df2, on='A')

Following that, make df2's items into a set:

df3.B_y = df3.B_y.apply(set)

Now use list comprehension to iterate over df1's items that are not in the set, for each row:

df3['res'] = df3.apply(lambda r: [e for e in r.B_x if e not in r.B_Y], axis=1)

Upvotes: 2

Nikunj masarani

Reputation: 71

You can try this,

 df1["B"]=[list(i for i in df1["B"][j] if i not in df2["B"][j]) for j in range(df1.shape[0])]

Upvotes: 1

How to difference a column in data frame which contains lists?

Answers (5)

Details

Related Questions