PySpark: Extract multiple sequences of numeric characters out of string

Question

I am trying to extract multiple sequences of numerical characters out of a long string. With regexp_extract, I am able to extract the first sequence, but how can I extract the following ones and write them into new columns (one column for every sequence)?

df.withColumn('IDENT1', F.df(bookings.IDENTIFIER, "(\d+)", 0))

+-----------------------------------------------------------------------------+----------+
|IDENTIFIER                                                                   |IDENT1    |                                                                                                                                  
+-----------------------------------------------------------------------------+----------+
|31010833 - 9911075145 - some more random text                                |31010833  |
|TL/9910900047/asdiqwoidw/more text// 23231263264                             |9910900047|
+-----------------------------------------------------------------------------+----------+

vladsiv · Accepted Answer

I would use a UDF to extract the numbers, put them in an array and then explode that array to columns, or just do the select.

Example:

import re
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType


def extract(text):
    all_matches = re.findall(r"-?\d(?:[,\d]*\.\d+|[,\d]*)", text)
    return all_matches


extract_udf = F.udf(extract, ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()
data = [
    {"test": "31010833 - 9911075145 - some more random text"},
    {"test": "TL/9910900047/asdiqwoidw/more text// 23231263264 "},
]
df = spark.createDataFrame(data=data)
df = df.withColumn("numbers", extract_udf("test"))
df = df.select(["test"] + [df.numbers[i] for i in range(2)])
df.show(10, False)
df.printSchema()

Result:

+-------------------------------------------------+----------+-----------+      
|test                                             |numbers[0]|numbers[1] |
+-------------------------------------------------+----------+-----------+
|31010833 - 9911075145 - some more random text    |31010833  |9911075145 |
|TL/9910900047/asdiqwoidw/more text// 23231263264 |9910900047|23231263264|
+-------------------------------------------------+----------+-----------+

root
 |-- test: string (nullable = true)
 |-- numbers[0]: string (nullable = true)
 |-- numbers[1]: string (nullable = true)

Also, if you use Spark 3.1+ you can use regexp_extract_all. See question: Extracting several regex matches in PySpark

PySpark: Extract multiple sequences of numeric characters out of string

Answers (1)

Related Questions