Dataholic
Dataholic

Reputation: 123

Extract a string in between two strings if a sub-string occurs in between those two strings- Pyspark

I have a spark dataframe with two columns (time_stamp and message),as shown below:

Example spark dataframe

      message                                             time_stamp
irrelevant_text Startstring [ID: 1AB]                   2015-01-23 08:23:16
some irrelevant text                                    2015-01-23 08:24:20
irrelevant_text mandatorystring ID [1AB]                2015-01-23 08:25:32
some irrelevant text                                    2015-01-23 08:27:18
irrelevant_text endstring [ID: 1AB]                     2015-01-23 08:30:47
some irrelevant text                                    2015-01-23 08:24:20
irrelevant_text Startstring [ID: 2BC]                   2015-01-23 10:05:16
some irrelevant text                                    2015-01-23 10:24:20
irrelevant_text endstring [ID: 2BC]                     2015-01-23 10:30:47
some irrelevant text                                    2015-01-23 10:50:20
irrelevant_text Startstring [ID: 3DE]                   2015-01-23 12:21:16
some irrelevant text                                    2015-01-23 12:24:20
irrelevant_text mandatorystring ID [3DE]                2015-01-23 12:37:32
some irrelevant text                                    2015-01-23 12:45:18
irrelevant_text endstring [ID: 3DE]                     2015-01-23 12:59:47

I am looking to extract the ID just after Startstring if mandatorystring exists between Startstring and endstring and discarding the IDs if mandatorystring doesn't exists between Startstring and endstring. There may be multiple such instances in one Date.

Expected Output:

time_stamp                    ID
2015-01-23 08:23:16          1AB
2015-01-23 12:21:16          3DE

Help in this regard is highly appreciated. Thanks

Upvotes: 0

Views: 1007

Answers (1)

mck
mck

Reputation: 42352

Clean up the data using some regex and then use lead to check the requirements.

df2 = df.withColumn(
    'id',
    F.regexp_extract('message', 'ID[:|\s][\s|\[]([0-9a-zA-Z]+)\]', 1)
).filter('length(id) > 0').withColumn(
    'messagetype',
    F.when(F.col('message').rlike('Startstring'), 0).when(F.col('message').rlike('endstring'), 1).when(F.col('message').rlike('mandatorystring'), 2)
).drop('message').withColumn(
    'include',
    (F.col('messagetype') == 0) & (F.lead('messagetype').over(Window.partitionBy('id').orderBy('time_stamp')) == 2) & (F.lead('messagetype', 2).over(Window.partitionBy('id').orderBy('time_stamp')) == 1)
).filter('include').select('time_stamp', 'id')

df2.show()
+-------------------+---+
|time_stamp         |id |
+-------------------+---+
|2015-01-23 08:23:16|1AB|
|2015-01-23 12:21:16|3DE|
+-------------------+---+

Upvotes: 1

Related Questions