Reputation: 27
I hope someone could help me solving this python scripting. I'd like to do something similar to index-match function in excel using Python.
I have two data sets shown below: -
a) The dataframe 1 (df):
Index | Node 1 | Node 2 | Value |
---|---|---|---|
0 | ABBDD | DSGHSH | 164.35 |
1 | ABNEQY | ASFCZC | 125.40 |
2 | ABNERR | ZDCDBB | 150.40 |
3 | BANDER | DSVBDG | 0.00 |
4 | GDCHR | CVXSVG | 94.05 |
5 | GFRSA | ZFGDHJD | 94.05 |
6 | SRHNAI | SDWRWW | 119.70 |
7 | ZDCDBB | BANDERR | 148.20 |
8 | VBZ3VG | EWEWWE | 148.20 |
9 | SDZXZC | ADGSGSG | 125.40 |
b) The dataframe 2 (df2):
Index | Node | Value |
---|---|---|
0 | SDZX | |
1 | VBZ3 | |
2 | AAAF | |
3 | ADGS | |
4 | DSVB | |
5 | ZDCD | |
6 | GFRSA | |
7 | ABNE | |
8 | ABBD | |
9 | EWEW | |
10 | SRHNA | |
11 | BANDER |
The first data frame comprises of two longer node names and a value column which I'd like to copy to the second data frame if its shorter node name contains in either one of the node names in the first data frame. If more than one value is available, only the maximum value will be selected.
My desired output:
Index | Node | Value |
---|---|---|
0 | SDZX | 125.40 |
1 | VBZ3 | 148.20 |
2 | AAAF | NaN |
3 | ADGS | 125.40 |
4 | DSVB | 0.00 |
5 | ZDCD | 150.40 |
6 | GFRSA | 94.05 |
7 | ABNE | 150.40 |
8 | ABBD | 164.35 |
9 | EWEW | 148.20 |
10 | SRHNA | 119.70 |
11 | BANDER | 148.20 |
I have tried several ways. First step, I reduce the Node 1 and Node 2 name to 5 characters. I understand this is not the right way as this implies that I'll miss out those Nodes with more than 5 characters. I can't figure out how to use str.contains in join/merge function in Pandas to solve this problem.
df['Node 1'] = df['Node 1'].str.slice(0,5)
df['Node 2'] = df['Node 2'].str[:5]
In the second step, I wrote the following code. It can seem to do the job but looks quite inefficient. I wonder if someone has any suggestion to improve the scripts. Basically, I split the steps into two. I compare the node names in df2 with Node 1 and then Node 2. Ideally I'd like to be able to compare two nodes in df with df2 at once.
match_node1 = df2.reset_index().set_index('Node').join(df.set_index(['Node 1']))
match_node1 = match_node1.sort_values('Value',ascending=True).drop_duplicates(subset='index', keep='last').set_index('index')
match_node2 = df2.reset_index().set_index('Node').join(df.set_index(['Node 2']))
match_node2 = match_node2.sort_values('Value',ascending=True).drop_duplicates(subset='index', keep='last').set_index('index')
Any help/advice will be greatly appreciated.
Upvotes: 0
Views: 343
Reputation: 914
I think this should work, now:
I started from scratch because you're only providing a tables not coded data simple.
Creating DataFrames
import pandas as pd
dicty1 = {'Node1':['ABNEQY', 'ABNERR', 'VBZ3VG'],
'Node2': ['ADGSGSG', 'EWEWWE', 'DSVBDG'],
'Value': [125.40, 150.40, 0.00]
}
dicty2 = {'Node':['ABNE', 'ABNE','VBZ3', 'ADGS', 'EWEW', 'DSVB'],
'value': None
}
df1 = pd.DataFrame(dicty1)
df2 = pd.DataFrame(dicty2)
Data Frame 1
Node1 | Node2 | Value |
---|---|---|
ABNEQY | ADGSGSG | 125.4 |
ABNERR | EWEWWE | 150.4 |
VBZ3VG | DSVBDG | 0.0 |
Data Frame 2
Node | value |
---|---|
ABNE | None |
ABNE | None |
VBZ3 | None |
ADGS | None |
EWEW | None |
DSVB | None |
Now, I had to concatenate both columns [Node1, Node2]
of df1
so, I can deal with them when I merge the dataframes.
cols = [df1['Node1'], df1['Node2']]
conc_cols = pd.concat(cols, names='Node', ignore_index= True) # This's a 'pandas Series' I've to convert it into a 'pandas DataFrame'
rename = conc_cols.rename('new_Node')
new_df = pd.DataFrame(rename)
new_df Dataframe
new_Node |
---|
ABNEQY |
ABNERR |
VBZ3VG |
ADGSGSG |
EWEWWE |
DSVBDG |
Now, I need to concatenate it with its values so, I've to duplicate the value
column to equalize the values of new_df.
new_df['duplicates_values'] = pd.concat([df1['Value'], df1['Value']], ignore_index=True)
new_df["new_Node"] = new_df['new_Node'].astype(str).str[:4]
Let's remove the duplicates from both df2 and new_df:
df2.drop_duplicates('Node',inplace=True)
new_df.drop_duplicates('new_Node',inplace=True, keep='last') #TODO change happens here
Then, pandas.merge():
df3 = pd.merge(df2, new_df, how= 'left', left_on = 'Node', right_on = 'new_Node')
DataFrame3
Node | value | new_Node | duplicates_ values |
---|---|---|---|
ABNE | None | ABNE | 150.4 |
VBZ3 | None | VBZ3 | 0.0 |
ADGS | None | ADGS | 125.4 |
EWEW | None | EWEW | 150.4 |
DSVB | None | DSVB | 0.0 |
Now, we need to select the duplicates_ values
and convert it into a list
then insert it into df2
-> value
df2['value'] = df3['duplicates_ values'].tolist()
OUTPUT Of DataFrame 2
Node | value |
---|---|
ABNE | 150.4 |
VBZ3 | 0.0 |
ADGS | 125.4 |
EWEW | 150.4 |
DSVB | 0.0 |
Upvotes: 2