Reputation: 91
i'm new to python programming. I have two dataframes df1 containing tags (180k rows) and df2 containing equipment names (1600 rows)
df1:
Line TagName CLASS 187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL 10 187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK 10 187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy 10 187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV 10 187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy 10
df2:
EquipmentNo EquipmentDescription Equipment
1311256 Lifting table LA120
1311257 Roller bed RB2200
1311258 Lifting table LT2202
1311259 Roller bed RB2202
1311260 Roller bed RB2204
The df2.Equipment is somewhere in between the strings in df1.TagName. i need to match based on if df2 Equipment is in df1 Tagname then df2 (Equipment description and Equipment no ) has to be matched with df1.
Final output should be
Line TagName quipmentdescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
I have tried this now
cols= df2['Equipment'].tolist()
Xs=[]
for i in cols:
Test = df1.loc[df1.TagName.str.contains(i)]
Test['Equip']=i
Xs.append(Test)
then merge xs and df2 based on "Equipment"
But i am getting this error
first argument must be string or compiled pattern
Upvotes: 1
Views: 2795
Reputation: 62513
df1
and df2
as follows:df1
| | Line | TagName | CLASS |
|---:|:-------|:--------------------------------|--------:|
| 0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL | 10 |
| 1 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK | 10 |
| 2 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy | 10 |
| 3 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV | 10 |
| 4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy | 10 |
df2
| | EquipmentNo | EquipmentDescription | Equipment |
|---:|--------------:|:-----------------------|:------------|
| 0 | 1311256 | Lifting table | LA120 |
| 1 | 1311257 | Roller bed | RB2200 |
| 2 | 1311258 | Lifting table | LT2202 |
| 3 | 1311259 | Roller bed | RB2202 |
| 4 | 1311260 | Roller bed | RB2204 |
equipment
from Equipment
in df2
equipment = df2.Equipment.unique().tolist()
Equipment
column in df1
by finding a match in equipment
df1['Equipment'] = df1['TagName'].apply(lambda x: ''.join([part for part in equipment if part in x]))
Equipment
into final form
Equipment
column in df_final
, add .drop(columns=['Equipment'])
to the end of the next line of code.df_final = df1[['Line', 'TagName', 'Equipment']].merge(df2, on='Equipment')
df_final
| | Line | TagName | Equipment | EquipmentNo | EquipmentDescription |
|---:|:-------|:--------------------------------|:------------|--------------:|:-----------------------|
| 0 | PT_WOA | .ZS01_LA120_T05.SB.S2384_LesSwL | LA120 | 1311256 | Lifting table |
| 1 | PT_WOA | .ZS01_LA120_T05.SB._CBAbsHy | LA120 | 1311256 | Lifting table |
| 2 | PT_WOA | .ZS01_LA120_T05.SB.S3110_CBAPV | LA120 | 1311256 | Lifting table |
| 3 | PT_WOA | .ZS01_RB2202_T05.SB.S2385_FLOK | RB2202 | 1311259 | Roller bed |
| 4 | PT_WOA | .ZS01_LARB2204.SB.S3111_CBRelHy | RB2204 | 1311260 | Roller bed |
Upvotes: 1
Reputation: 2657
I would do something like this:
Create a new column indexes
where for every Equipment
in df2 find a list of Indexes in df1 where df1.TagName contains the Equipment
.
Flatten the indexes
by creating one row for each item using stack()
and reset_index()
from io import StringIO
import numpy as np
import pandas as pd
df1=StringIO("""Line;TagName;CLASS
187877;PT_WOA;.ZS01_LA120_T05.SB.S2384_LesSwL;10
187878;PT_WOA;.ZS01_RB2202_T05.SB.S2385_FLOK;10
187879;PT_WOA;.ZS01_LA120_T05.SB._CBAbsHy;10
187880;PT_WOA;.ZS01_LA120_T05.SB.S3110_CBAPV;10
187881;PT_WOA;.ZS01_LARB2204.SB.S3111_CBRelHy;10""")
df2=StringIO("""EquipmentNo;EquipmentDescription;Equipment
1311256;Lifting table;LA120
1311257;Roller bed;RB2200
1311258;Lifting table;LT2202
1311259;Roller bed;RB2202
1311260;Roller bed;RB2204""")
df1=pd.read_csv(df1,sep=";")
df2=pd.read_csv(df2,sep=";")
df2['indexes'] = df2['Equipment'].apply(lambda x: df1.index[df1.TagName.str.contains(str(x)).tolist()].tolist())
indexes = df2.apply(lambda x: pd.Series(x['indexes']),axis=1).stack().reset_index(level=1, drop=True)
indexes.name = 'indexes'
df2 = df2.drop('indexes', axis=1).join(indexes).dropna()
df2.index = df2['indexes']
matches = df2.join(df1, how='inner')
print(matches[['Line','TagName','EquipmentDescription','EquipmentNo']])
OUTPUT:
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
Upvotes: 1
Reputation: 3030
Initializing the provided dataframes:
import numpy as np
import pandas as pd
df1 = pd.DataFrame([['PT_WOA', '.ZS01_LA120_T05.SB.S2384_LesSwL', 10],
['PT_WOA', '.ZS01_RB2202_T05.SB.S2385_FLOK', 10],
['PT_WOA', '.ZS01_LA120_T05.SB._CBAbsHy', 10],
['PT_WOA', '.ZS01_LA120_T05.SB.S3110_CBAPV', 10],
['PT_WOA', '.ZS01_LARB2204.SB.S3111_CBRelHy', 10]],
columns = ['Line', 'TagName', 'CLASS'],
index = [187877, 187878, 187879, 187880, 187881])
df2 = pd.DataFrame([[1311256, 'Lifting table', 'LA120'],
[1311257, 'Roller bed', 'RB2200'],
[1311258, 'Lifting table', 'LT2202'],
[1311259, 'Roller bed', 'RB2202'],
[1311260, 'Roller bed', 'RB2204']],
columns = ['EquipmentNo', 'EquipmentDescription', 'Equipment'])
I would suggest the following:
# create a copy of df1, dropping the 'CLASS' column
df3 = df1.drop(columns=['CLASS'])
# add the columns 'EquipmentDescription' and 'Equipment' filled with numpy NaN's
df3['EquipmentDescription'] = np.nan
df3['EquipmentNo'] = np.nan
# for each row in df3, iterate over each row in df2
for index_df3, row_df3 in df3.iterrows():
for index_df2, row_df2 in df2.iterrows():
# check if 'Equipment' is in 'TagName'
if df2.loc[index_df2, 'Equipment'] in df3.loc[index_df3, 'TagName']:
# set 'EquipmentDescription' and 'EquipmentNo'
df3.loc[index_df3, 'EquipmentDescription'] = df2.loc[index_df2, 'EquipmentDescription']
df3.loc[index_df3, 'EquipmentNo'] = df2.loc[index_df2, 'EquipmentNo']
# conver the 'EquipmentNo' to type int
df3['EquipmentNo'] = df3['EquipmentNo'].astype(int)
This yields the following dataframe:
Line TagName EquipmentDescription EquipmentNo
187877 PT_WOA .ZS01_LA120_T05.SB.S2384_LesSwL Lifting table 1311256
187878 PT_WOA .ZS01_RB2202_T05.SB.S2385_FLOK Roller bed 1311259
187879 PT_WOA .ZS01_LA120_T05.SB._CBAbsHy Lifting table 1311256
187880 PT_WOA .ZS01_LA120_T05.SB.S3110_CBAPV Lifting table 1311256
187881 PT_WOA .ZS01_LARB2204.SB.S3111_CBRelHy Roller bed 1311260
Let me know if this helps.
Upvotes: 1