pyspark filter a column by regular expression?

Question

i would like to filter a column in my pyspark dataframe using regular expression. I want to do something like this but using regular expression:

newdf = df.filter("only return rows with 8 to 10 characters in column called category")

This is my regular expression:

regex_string = "(\d{8}$|\d{9}$|\d{10}$)"

column category is of string type in python.

notNull · Accepted Answer

Try with length() function in spark.

Example:

df=spark.createDataFrame([('abcdefghij',),('abcdefghi',),('abcdefgh',),('abcdefghijk',)],['str_col'])

from pyspark.sql.functions import *

df.filter((length(col("str_col")) >= 8) & (length(col("str_col")) <= 10)).show()
#+----------+
#|   str_col|
#+----------+
#|abcdefghij|
#| abcdefghi|
#|  abcdefgh|
#+----------+

Using Regex .rlike function:

df.filter(col("str_col").rlike("^\w{8,10}$")).show()
#+----------+
#|   str_col|
#+----------+
#|abcdefghij|
#| abcdefghi|
#|  abcdefgh|
#+----------+

pyspark filter a column by regular expression?

Answers (1)

Related Questions