Reputation: 73
I am trying to extract the number between the string “line_number:” and hyphen. I am struggling with generating a regex/substring for the same in PySpark. Below is my input data in a column called “whole_text”. The string “line_number:” will always be in each row followed by the number and hyphen. Is there any way I can find the text “line_number:” and first hyphen after that and extract the number in between?
The output should be 121, 3112 and so on in a new column.
Please help.
text:ABC12637-XYZ line_number:121-ABC:JJ11
header:3AXYZ166-LMN line_number:3112-GHI:3A1
Upvotes: 0
Views: 291
Reputation: 373
Some minimal example code would be useful to replicate your problem..
Here is how I'd solve this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([("""
text:ABC12637-XYZ line_number:121-ABC:JJ11
header:3AXYZ166-LMN line_number:3112-GHI:3A1
""",)], ['str'])
df.select("str", F.expr(r"regexp_extract_all(str, r'line_number:(\d+)-', 1)").alias('extracted')).show()
Which produces:
+--------------------+-----------+
| str| extracted|
+--------------------+-----------+
|\ntext:ABC12637-X...|[121, 3112]|
+--------------------+-----------+
Update:
df.withColumn('extracted_regex', F.expr(r"regexp_extract_all(str, r'line_number:(\d+)-', 1)")).show()
+--------------------+---------------+
| str|extracted_regex|
+--------------------+---------------+
|\ntext:ABC12637-X...| [121, 3112]|
+--------------------+---------------+
Using Python 3.12 and Spark 3.5
>>> spark.version
'3.5.0'
Upvotes: 1