How to use multiple regex patterns using rlike in pyspark

Question

I have to use multiple patterns to filter a large file. The problem is I am not sure about the efficient way of applying multiple patterns using rlike. As an example

df = spark.createDataFrame(
    [
        ('www 17 north gate',),
        ('aaa 45 north gate',),
        ('bbb 56 west gate',),
        ('ccc 56 south gate',),
        ('Michigan gate',),
        ('Statue of Liberty',),
        ('57 adam street',),
        ('19 west main street',),
        ('street burger',)
    ],
    [ 'poi']
)

df.show()
+-------------------+
|                poi|
+-------------------+
|  www 17 north gate|
|  aaa 45 north gate|
|   bbb 56 west gate|
|  ccc 56 south gate|
|      Michigan gate|
|  Statue of Liberty|
|     57 adam street|
|19 west main street|
|      street burger|
+-------------------+

If i use the following two patterns from the data I could do

pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
|              poi|
+-----------------+
|www 45 north gate|
|    Michigan gate|
|Statue of Liberty|
|    street burger|
+-----------------+

how about if I have 40 different patterns? I guess I can use a loop like this

for pat in [pat1,pat2,....,patn]:
    df = df.filter(~df['poi'].rlike(pat))

Is this the right approach? The original data is in Chinese so please ignore if the patterns are efficient or not. I am just interested to see how I handle multiple regex patterns.

pault · Accepted Answer

Both approaches you suggested have the same execution plan:

Using two patterns in succession:

df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ && 
#         NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]

Using a loop:

# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ && 
#         NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]

An alternative approach is to combine all your patterns into one using "|".join() to chain them together with the regex or operator. The main difference is that this will result in only one call to rlike (as opposed to one call per pattern in the other method):

df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]

How to use multiple regex patterns using rlike in pyspark

Answers (1)

Related Questions