Asan
Asan

Reputation: 357

Pyspark: Find a substring delimited by multiple characters

Im trying to extract a substring that is delimited by other substrings in Pyspark. In the example text, the desired string would be THEVALUEINEED, which is delimited by "meterValue=" and by "{". This is important since there are several values in the string i'm trying to parse following the same format: "field= THEVALUE {". Therefore I need to specify the initial delimiting string and the string that comes after the desired value (closing string).

string = request={meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE...

I tried solving it with re.search (See here) but I cannot make it run in PySpark when I pass a column of strings as the input string.

"message" is the column that contains the various strings in each row, therefore I want to parse the values and paste them on a new column (newColumn).

 df = df.withColumn("newColumn", re.search('meterValue={(.*)}', F.col("message")))

Upvotes: 1

Views: 169

Answers (1)

wwnde
wwnde

Reputation: 26676

+---+---------------------------------------------------+
|id |message                                            |
+---+---------------------------------------------------+
|1  |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|
|2  |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|
+---+---------------------------------------------------+

Just do that, find the words between meterValue= and {

df.withColumn('name', regexp_extract('message', '(?<=meterValue=)(\w+)(?=\{)',1)).show()

output

+---+---------------------------------------------------+-------------+
|id |message                                            |name         |
+---+---------------------------------------------------+-------------+
|1  |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|THEVALUEINEED|
|2  |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|THEVALUEINEED|
+---+---------------------------------------------------+-------------+

Upvotes: 1

Related Questions