Maxiloy
Maxiloy

Reputation: 27

Index-match/MaxIF function using Python

I hope someone could help me solving this python scripting. I'd like to do something similar to index-match function in excel using Python.

I have two data sets shown below: -

a) The dataframe 1 (df):

Index Node 1 Node 2 Value
0 ABBDD DSGHSH 164.35
1 ABNEQY ASFCZC 125.40
2 ABNERR ZDCDBB 150.40
3 BANDER DSVBDG 0.00
4 GDCHR CVXSVG 94.05
5 GFRSA ZFGDHJD 94.05
6 SRHNAI SDWRWW 119.70
7 ZDCDBB BANDERR 148.20
8 VBZ3VG EWEWWE 148.20
9 SDZXZC ADGSGSG 125.40

b) The dataframe 2 (df2):

Index Node Value
0 SDZX
1 VBZ3
2 AAAF
3 ADGS
4 DSVB
5 ZDCD
6 GFRSA
7 ABNE
8 ABBD
9 EWEW
10 SRHNA
11 BANDER

The first data frame comprises of two longer node names and a value column which I'd like to copy to the second data frame if its shorter node name contains in either one of the node names in the first data frame. If more than one value is available, only the maximum value will be selected.

My desired output:

Index Node Value
0 SDZX 125.40
1 VBZ3 148.20
2 AAAF NaN
3 ADGS 125.40
4 DSVB 0.00
5 ZDCD 150.40
6 GFRSA 94.05
7 ABNE 150.40
8 ABBD 164.35
9 EWEW 148.20
10 SRHNA 119.70
11 BANDER 148.20

I have tried several ways. First step, I reduce the Node 1 and Node 2 name to 5 characters. I understand this is not the right way as this implies that I'll miss out those Nodes with more than 5 characters. I can't figure out how to use str.contains in join/merge function in Pandas to solve this problem.

df['Node 1'] = df['Node 1'].str.slice(0,5)
df['Node 2'] = df['Node 2'].str[:5]

In the second step, I wrote the following code. It can seem to do the job but looks quite inefficient. I wonder if someone has any suggestion to improve the scripts. Basically, I split the steps into two. I compare the node names in df2 with Node 1 and then Node 2. Ideally I'd like to be able to compare two nodes in df with df2 at once.

match_node1 = df2.reset_index().set_index('Node').join(df.set_index(['Node 1']))
match_node1 = match_node1.sort_values('Value',ascending=True).drop_duplicates(subset='index', keep='last').set_index('index')
match_node2 = df2.reset_index().set_index('Node').join(df.set_index(['Node 2']))
match_node2 = match_node2.sort_values('Value',ascending=True).drop_duplicates(subset='index', keep='last').set_index('index')

Any help/advice will be greatly appreciated.

Upvotes: 0

Views: 343

Answers (1)

Ahmed
Ahmed

Reputation: 914

I think this should work, now:

I started from scratch because you're only providing a tables not coded data simple.

Creating DataFrames

import pandas as pd
dicty1 = {'Node1':['ABNEQY', 'ABNERR', 'VBZ3VG'],
                'Node2': ['ADGSGSG', 'EWEWWE', 'DSVBDG'],
                'Value': [125.40, 150.40,  0.00]
            }
dicty2 = {'Node':['ABNE', 'ABNE','VBZ3', 'ADGS', 'EWEW', 'DSVB'],
                'value': None
              }

df1 = pd.DataFrame(dicty1)
df2 = pd.DataFrame(dicty2) 

Data Frame 1

Node1 Node2 Value
ABNEQY ADGSGSG 125.4
ABNERR EWEWWE 150.4
VBZ3VG DSVBDG 0.0

Data Frame 2

Node value
ABNE None
ABNE None
VBZ3 None
ADGS None
EWEW None
DSVB None

Now, I had to concatenate both columns [Node1, Node2] of df1 so, I can deal with them when I merge the dataframes.

cols = [df1['Node1'], df1['Node2']]
conc_cols = pd.concat(cols, names='Node', ignore_index= True)   # This's a 'pandas Series' I've to convert it into a 'pandas DataFrame'
rename = conc_cols.rename('new_Node')
new_df = pd.DataFrame(rename)

new_df Dataframe

new_Node
ABNEQY
ABNERR
VBZ3VG
ADGSGSG
EWEWWE
DSVBDG

Now, I need to concatenate it with its values so, I've to duplicate the value column to equalize the values of new_df.

new_df['duplicates_values'] = pd.concat([df1['Value'], df1['Value']], ignore_index=True)
new_df["new_Node"] = new_df['new_Node'].astype(str).str[:4] 

Let's remove the duplicates from both df2 and new_df:

df2.drop_duplicates('Node',inplace=True)
new_df.drop_duplicates('new_Node',inplace=True, keep='last') #TODO change happens here

Then, pandas.merge():

df3 =  pd.merge(df2, new_df, how= 'left', left_on = 'Node', right_on = 'new_Node')

DataFrame3

Node value new_Node duplicates_ values
ABNE None ABNE 150.4
VBZ3 None VBZ3 0.0
ADGS None ADGS 125.4
EWEW None EWEW 150.4
DSVB None DSVB 0.0

Now, we need to select the duplicates_ values and convert it into a list then insert it into df2 -> value

df2['value'] = df3['duplicates_ values'].tolist()

OUTPUT Of DataFrame 2

Node value
ABNE 150.4
VBZ3 0.0
ADGS 125.4
EWEW 150.4
DSVB 0.0

Upvotes: 2

Related Questions