Reputation: 357
Im trying to extract a substring that is delimited by other substrings in Pyspark. In the example text, the desired string would be THEVALUEINEED, which is delimited by "meterValue=" and by "{". This is important since there are several values in the string i'm trying to parse following the same format: "field= THEVALUE {". Therefore I need to specify the initial delimiting string and the string that comes after the desired value (closing string).
string = request={meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE...
I tried solving it with re.search (See here) but I cannot make it run in PySpark when I pass a column of strings as the input string.
"message" is the column that contains the various strings in each row, therefore I want to parse the values and paste them on a new column (newColumn).
df = df.withColumn("newColumn", re.search('meterValue={(.*)}', F.col("message")))
Upvotes: 1
Views: 169
Reputation: 26676
+---+---------------------------------------------------+
|id |message |
+---+---------------------------------------------------+
|1 |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|
|2 |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|
+---+---------------------------------------------------+
Just do that, find the words between meterValue=
and {
df.withColumn('name', regexp_extract('message', '(?<=meterValue=)(\w+)(?=\{)',1)).show()
output
+---+---------------------------------------------------+-------------+
|id |message |name |
+---+---------------------------------------------------+-------------+
|1 |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|THEVALUEINEED|
|2 |{meterValue=THEVALUEINEED{sampledVaLue=ANOTHERVALUE|THEVALUEINEED|
+---+---------------------------------------------------+-------------+
Upvotes: 1