BeastlyBernardo
BeastlyBernardo

Reputation: 41

Create a new column if one dataframe's row value is in another data frame's column and get that index

I may be overcomplicating this problem, however I can't seem to find a simple solution.

I have two DataFrame's. Let's call them df1 and df2. To keep things simple. Let's say df1 has one column called "Some Data" and df2 has two columns called "some data" and "other data".

Example:

df1

Some Data "Lebron James 123" "Lebron James 234"

df2

some data                        other data
"Lebron James 123 + other text"  "I want this in df1["New?"]"
"Michael Jordan"                 "Doesn't Matter"

So basically I want to create a new column in df1 called "New?". This new column (in df1) will say "New" if df1["Some data"] is in df2["Some other data"]. However, if there is no instance in df2["some data"], then I set the df1["New?"] to that specific row's value in df2["other data"].

Desired result after running:

df1

Some Data                         New?
"Lebron James 123"  "I want this in df1["New?"]"
"Lebron James 234"               "New"

So as you can see The New? column would include that specific row's value from the other data column. Lebron James 234 isn't anywhere in some data in df2 so it says new.

I am able to get it to say True or False using the .isin() method, however don't know how to grab the index of the other df and get the value from the other data column.

Thank you

EDIT:

From what I know will work

df["New?"] = df1["Some Data"].isin(df2["some data"])

Would render

df1["New?"]

True
False

So I want True to be the "I want this in df1["New?"]" and False to be New

Upvotes: 1

Views: 61

Answers (2)

rafaelc
rafaelc

Reputation: 59274

Based on your info, seems like you need only a simple np.where (if dfs have same length)

df1['New?'] = np.where(df1["Some Data"].isin(df2["some data"]), df2['other data'], 'New')

    Some Data                       New?
0   Lebron James 123 + other text   I want this in df1[New?"]"
1   Lebron James 234                New

For different length,

mask = df2["some data"].isin(df["Some Data"]).values
df.loc[mask,'New'] = df2.loc[mask, 'other data']

df.fillna('New')

Explanation

Basically you have a mask, and you use the same mask to filter both data frames. This yields the same number of results on both dfs given the descriptions, and you assign the filtered rows' "other data" values from df2 to the same matching rows in df "some data"

Upvotes: 1

user3483203
user3483203

Reputation: 51165

First create a regular expression by joining your df1 series:

rgx = '|'.join(df1['some data'])

Now using np.where:

df1.assign(data=np.where(df2['some data'].str.match(rgx), df2['other data'], 'New'))

          some data                        data
0  Lebron James 123  I want this in df1["New?"]
1  Lebron James 234                         New

An example with mismatching shapes:

df1 = pd.DataFrame({'a': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'a': ['aaaaa', 'bbbb', 'ffff', 'gggg', 'hhhh']})

rgx = '({})'.format('|'.join(df1.a))
m = df2.assign(flag=df2.a.str.extract(rgx))

df1.set_index('a').join(m.set_index('flag')).fillna('New').reset_index()

  index      a
0     a  aaaaa
1     b   bbbb
2     c    New
3     d    New

Upvotes: 1

Related Questions