JChat
JChat

Reputation: 814

How to use an equality condition for manipulating a Pandas Dataframe based on another dataframe?

I have a dataframe in Python, say A, which has multiple columns, including columns named ECode and FG. I have another Pandas dataframe B, also with multiple columns, including columns named ECode,F Gping (note the space in column name for F Gping) and EDesc. What I would like to do is to create a new column called EDesc in dataframe A based on following conditions (Note that EDesc, FG and F Gping contain String type values (text), while the remaining columns are numeric/floating type. Also, dataframes A and B are of different dimensions (with differing rows and columns, and I want to check equality in specific values in the dataframe columns):-

  1. First, for all rows in dataframe A, where value in ECode matches value ECode in dataframe B, then, in the new column EDesc to be created in dataframe A, add the same values as EDesc in B.
  2. Secondly, for all rows in dataframe A where value in FG matches F Gping values, in the new column EDesc in A, add same values as EDesc in B.
  3. After this, if the newly created EDesc column in A still has missing values/NaNs, then add the string value MissingValue to all the rows in the Dataframe A's EDesc column.

I have tried using for loops, as well as list comprehensions, but they don't help in accomplishing this. Moreover, the space within column name F Gping in B is created problems to access the same, as though I can access it like B['F Gping'], it isn't solving the very purpose. Any help in this regard is appreciated.

Upvotes: 0

Views: 268

Answers (1)

totooooo
totooooo

Reputation: 1405

I'm assuming values are unique in B['ECode'] and B['F Gping'], otherwise we'll have to choose which value we give to A['EDesc'] when we find two matching values for ECode or FG.

There might be a smarter way but here's what I would do with joins:

Example DataFrames:

A = pd.DataFrame({'ECode': [1, 1, 3, 4, 6],
                  'FG': ['a', 'b', 'c', 'b', 'y']})
B = pd.DataFrame({'ECode': [1, 2, 3, 5],
                  'F Gping': ['b', 'c', 'x', 'x'],
                  'EDesc': ['a', 'b', 'c', 'd']})

So they look like:

A
   ECode FG
0      1  a
1      1  b
2      3  c
3      4  b
4      6  y

B
   ECode F Gping EDesc
0      1       b     a
1      2       c     b
2      3       x     c
3      5       x     d

First let's create A['EDesc'] by saying that it's the result of joining A and B on ECode. We'll temporarily use EDesc as index:

A.set_index('ECode', inplace=True, drop=False)
B.set_index('ECode', inplace=True, drop=False)

A['EDesc'] = A.join(B, lsuffix='A')['EDesc']

This works because the result of A.join(B, lsuffix='A') is:

       ECodeA FG  ECode F Gping EDesc
ECode                                
1           1  a    1.0       b     a
1           1  b    1.0       b     a
3           3  c    3.0       x     c
4           4  b    NaN     NaN   NaN
6           6  y    NaN     NaN   NaN

Now let's fillna on A['EDesc'], using the match on FG. Same thing:

A.set_index('FG', inplace=True, drop=False)
B.set_index('F Gping', inplace=True, drop=False)

A['EDesc'].fillna(A.join(B, lsuffix='A')['EDesc'].drop_duplicates(), inplace=True)

This works because the result of A.join(B, lsuffix='A') is:

   ECodeA FG EDescA  ECode F Gping EDesc
a       1  a      a    NaN     NaN   NaN
b       1  b      a    1.0       b     a
b       4  b    NaN    1.0       b     a
c       3  c      c    2.0       c     b
y       6  y    NaN    NaN     NaN   NaN

Also we dropped the duplicates because as you can see there are two b's in our index.

Finally let's fillna with "Missing" and reset the index:

A['EDesc'].fillna('Missing', inplace=True)
A.reset_index(drop=True, inplace=True)

Upvotes: 1

Related Questions