Pyspark SQL select all rows which matches pattern using LIKE not working

Question

I have this CSV file:

CodiceFiscale;IVA;RAE;SAE
RDCRLR97F07F839F;11111111111;123;123
rdcrlr97f07f839f;11111111111;123;123
errorctrl1 ;34;12;1
error;12;12;1113

And I want to select all the rows where the column codicefiscale follows this pattern: [A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][0-9][0-9][A-Z][0-9][0-9][A-Z][0-9][0-9][0-9][A-Z].

I have made this query on pyspark sql:

SELECT * FROM soggetti WHERE CodiceFiscale LIKE ('[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z][0-9][0-9][A-Z][0-9][0-9][A-Z][0-9][0-9][0-9][A-Z]')

But it doesn't work

blackbishop · Accepted Answer

Use this regex : ^[A-Z]{6}[0-9]{2}[A-Z][0-9]{2}[A-Z][0-9]{3}[A-Z]$ with rlike SQL function.

The qualifier {n} mean it matches exactly n times. You need also to add ^ (start of line) and $ (end of line) to match the entire CodiceFiscale :

data = [
    ("RDCRLR97F07F839F", 11111111111, 123, 123),
    ("rdcrlr97f07f839f", 11111111111, 123, 123),
    ("errorctrl1", 34, 12, 1),
    ("error", 12, 12, 1113)
]
df = spark.createDataFrame(data, ["CodiceFiscale", "IVA", "RAE", "SAE"])

df.createOrReplaceTempView("soggetti")

spark.sql("""
    SELECT  * 
    FROM    soggetti 
    WHERE   CodiceFiscale RLIKE '^[A-Z]{6}[0-9]{2}[A-Z][0-9]{2}[A-Z][0-9]{3}[A-Z]$'
""").show() 

#+----------------+-----------+---+---+
#|   CodiceFiscale|        IVA|RAE|SAE|
#+----------------+-----------+---+---+
#|RDCRLR97F07F839F|11111111111|123|123|
#+----------------+-----------+---+---+

Pyspark SQL select all rows which matches pattern using LIKE not working

Answers (2)

Related Questions