Reputation: 123
I have a spark dataframe with two columns (time_stamp and message),as shown below:
Example spark dataframe
message time_stamp
irrelevant_text Startstring [ID: 1AB] 2015-01-23 08:23:16
some irrelevant text 2015-01-23 08:24:20
irrelevant_text mandatorystring ID [1AB] 2015-01-23 08:25:32
some irrelevant text 2015-01-23 08:27:18
irrelevant_text endstring [ID: 1AB] 2015-01-23 08:30:47
some irrelevant text 2015-01-23 08:24:20
irrelevant_text Startstring [ID: 2BC] 2015-01-23 10:05:16
some irrelevant text 2015-01-23 10:24:20
irrelevant_text endstring [ID: 2BC] 2015-01-23 10:30:47
some irrelevant text 2015-01-23 10:50:20
irrelevant_text Startstring [ID: 3DE] 2015-01-23 12:21:16
some irrelevant text 2015-01-23 12:24:20
irrelevant_text mandatorystring ID [3DE] 2015-01-23 12:37:32
some irrelevant text 2015-01-23 12:45:18
irrelevant_text endstring [ID: 3DE] 2015-01-23 12:59:47
I am looking to extract the ID just after Startstring if mandatorystring exists between Startstring and endstring and discarding the IDs if mandatorystring doesn't exists between Startstring and endstring. There may be multiple such instances in one Date.
Expected Output:
time_stamp ID
2015-01-23 08:23:16 1AB
2015-01-23 12:21:16 3DE
Help in this regard is highly appreciated. Thanks
Upvotes: 0
Views: 1007
Reputation: 42352
Clean up the data using some regex and then use lead
to check the requirements.
df2 = df.withColumn(
'id',
F.regexp_extract('message', 'ID[:|\s][\s|\[]([0-9a-zA-Z]+)\]', 1)
).filter('length(id) > 0').withColumn(
'messagetype',
F.when(F.col('message').rlike('Startstring'), 0).when(F.col('message').rlike('endstring'), 1).when(F.col('message').rlike('mandatorystring'), 2)
).drop('message').withColumn(
'include',
(F.col('messagetype') == 0) & (F.lead('messagetype').over(Window.partitionBy('id').orderBy('time_stamp')) == 2) & (F.lead('messagetype', 2).over(Window.partitionBy('id').orderBy('time_stamp')) == 1)
).filter('include').select('time_stamp', 'id')
df2.show()
+-------------------+---+
|time_stamp |id |
+-------------------+---+
|2015-01-23 08:23:16|1AB|
|2015-01-23 12:21:16|3DE|
+-------------------+---+
Upvotes: 1