Reputation: 45
i would like to filter a column in my pyspark dataframe using regular expression. I want to do something like this but using regular expression:
newdf = df.filter("only return rows with 8 to 10 characters in column called category")
This is my regular expression:
regex_string = "(\d{8}$|\d{9}$|\d{10}$)"
column category is of string type in python.
Upvotes: 1
Views: 6856
Reputation: 31480
Try with length()
function in spark.
Example:
df=spark.createDataFrame([('abcdefghij',),('abcdefghi',),('abcdefgh',),('abcdefghijk',)],['str_col'])
from pyspark.sql.functions import *
df.filter((length(col("str_col")) >= 8) & (length(col("str_col")) <= 10)).show()
#+----------+
#| str_col|
#+----------+
#|abcdefghij|
#| abcdefghi|
#| abcdefgh|
#+----------+
Using Regex .rlike
function:
df.filter(col("str_col").rlike("^\w{8,10}$")).show()
#+----------+
#| str_col|
#+----------+
#|abcdefghij|
#| abcdefghi|
#| abcdefgh|
#+----------+
Upvotes: 2