Reputation: 183
I have 2 dfs
df_1
Nº.do Incidente Status Description Per_Extracao
0 IN6948271 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822... DE : 2022/01/05 ATÉ : 2022/12/08
1 IN6948304 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ... DE : 2022/01/05 ATÉ : 2022/12/08
2 IN6948307 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX... DE : 2022/01/05 ATÉ : 2022/12/08
3 IN6948309 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX... DE : 2022/01/05 ATÉ : 2022/12/08
4 IN6948310 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX... DE : 2022/01/05 ATÉ : 2022/12/08
5 IN6948311 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_ADAT_SPRK_EX... DE : 2022/01/05 ATÉ : 2022/12/08
df_2
JOB_NAME JOB_STREAM_NAME
0 PP_AACD_NR_D8706_TIHIBRIDA_PROC_EXCUC_D P26_AACD_FAC_TOD
1 PP_SASG_GD9822 P26_AACE_U08
2 PP_AACE_R4539 P26_AACE_U09
3 PP_AACE_R4539_CONS_JUNC P26_AACE_U08
4 PP_AACE_R4539_FMRC_TD_01 P26_AACE_U08
5 PP_AACE_R4539_FMRC_TD_02 P26_AACE_U08
I'm trying to merge then based on the value of JOB_NAME in df_2
the output should be something like this
merged_df
Nº.do Incidente Status Description Per_Extracao JOB_NAME JOB_STREAM_NAME
0 IN6948271 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822... DE : 2022/01/05 ATÉ : 2022/12/08 PP_SASG_GD9822 P26_AACE_U08
1 IN6948304 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ... DE : 2022/01/05 ATÉ : 2022/12/08 PP_AACE_R4539 P26_AACE_U09
its not a regular join, its a contains contains condition("JOB_NAME" value in df_2 founded in "Description" column of df_1).
could you guys help me please?
Upvotes: 0
Views: 70
Reputation: 7903
Based on this answer we can merge your dfs with some little changes.
pat = "|".join(map(str, df2['JOB_NAME']))
df1.insert(0, 'merge_key', df1['Description'].str.extract("("+pat+")", expand=False))
df = df1.merge(df2, left_on='merge_key', right_on='JOB_NAME').drop('merge_key', axis=1)
print(df)
Nº.do Incidente Status Description jobName Per_Extracao JOB_NAME JOB_STREAM_NAME
0 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_SASG_GD9822... NaN DE : 2022/01/05 ATÉ : 2022/12/08 PP_SASG_GD9822 P26_AACE_U08
1 ENCERRADO GR26 D.I.T.I. >>> ABEND NO JOB PP_AACE_R4539 ... NaN DE : 2022/01/05 ATÉ : 2022/12/08 PP_AACE_R4539 P26_AACE_U09
Explanation:
The variable pat contains one big string with all jobnames we are searching for, seperated by |
.
Then we insert a new column called merge_key
at the 1st position (which doesn't really matter here, it could also be the last column) with str.extraxt
, which searchs in the column df1['description']
in every row, if there is a match with any name of all job names in pat
and if there is a match, it adds that match to the new column (we seperated them with |
because in regex the |
stands for or
, the ()
are needed to make it caputered groups (see the documentation) for more details on regular expressions.
With the new column we can merge them two df on merge_key
and JOB_NAME
At the end we drop that column merge_key
since we only needed it for the merging.
Upvotes: 2
Reputation: 551
The easiest way is:
- Imagine this is your first csv file ::
So as you can see these 2 data is like yours on pattern , there are only 2 rows on Sensor_2
column that are equal to the other file . ( check 1st and 2nd row on both tables)
now what we can do is :
import pandas as pd
df1 = pd.read_excel("myData1.xlsx")
df1.head()
df2 = pd.read_excel("myData2.xlsx")
df2.head()
df_merged = df1.merge(df2,on=["Sensor_2"],suffixes=('', '_drop'))
df_merged.drop([col for col in df_merged.columns if 'drop' in col], axis=1, inplace=True)
print(df_merged.head())
and the result is :
Sensor_1 Sensor_2 Sensor_3 Year
0 10 1 1.5 2020
1 11 2 2.0 2020
Upvotes: -1