PySpark - Using lists inside LIKE operator

Question

I would like to use list inside the LIKE operator on pyspark in order to create a column.

I have the following input df :

input_df :

+------+--------------------+-------+
|    ID|           customers|country|
+------+--------------------+-------+
|161   |xyz Limited         |U.K.   |
|262   |ABC  Limited        |U.K.   |
|165   |Sons & Sons         |U.K.   |
|361   |TÜV GmbH            |Germany|
|462   |Mueller GmbH        |Germany|
|369   |Schneider AG        |Germany|
|467   |Sahm UG             |Austria|
+------+--------------------+-------+

I would like to add a column CAT_ID. CAT_ID takes value 1 if "ID" contains "16" or "26". CAT_ID takes value 2 if "ID" contains "36" or "46".
So, I want my output df to look like this -

The desired output_df :

+------+--------------------+-------+-------+
|    ID|           customers|country|Cat_ID |
+------+--------------------+-------+-------+
|161   |xyz Limited         |U.K.   |1      |
|262   |ABC  Limited        |U.K.   |1      |
|165   |Sons & Sons         |U.K.   |1      |
|361   |TÜV GmbH            |Germany|2      |
|462   |Mueller GmbH        |Germany|2      |
|369   |Schneider AG        |Germany|2      |
|467   |Sahm UG             |Austria|2      |
+------+--------------------+-------+-------+

I am interested in learning how this can be done using LIKE statement and lists.

I know how to implement it without list, which works perfectly:

from pyspark.sql import functions as F

def add_CAT_ID(df):
    return df.withColumn(
        'CAT_ID', 
        F.when( ( (F.col('ID').like('16%')) | (F.col('ID').like('26%'))  ) , "1") \
         .when( ( (F.col('ID').like('36%')) | (F.col('ID').like('46%'))  ) , "2") \
         .otherwise('999')
    )


    output_df = add_CAT_ID(input_df)

However, I would love to use list and have something like:

list1 =['16', '26']
list2 =['36', '46']


def add_CAT_ID(df):
    return df.withColumn(
        'CAT_ID', 
        F.when( ( (F.col('ID').like(list1 %))  ) , "1") \
         .when( ( (F.col('ID').like('list2 %'))  ) , "2") \
         .otherwise('999')
    )


    output_df = add_CAT_ID(input_df)

Thanks a lot in advance,

MaFF · Accepted Answer

SQL wildcards do not support "or" clauses. There are several ways you can handle it though.

1. Regular expressions

You can use rlike with a regular expression:

import pyspark.sql.functions as psf

list1 =['16', '26'] 
list2 =['36', '46']
df.withColumn(
        'CAT_ID', 
        psf.when(psf.col('ID').rlike('({})\d'.format('|'.join(list1))), '1') \
            .when(psf.col('ID').rlike('({})\d'.format('|'.join(list2))), '2') \
            .otherwise('999')) \
    .show()

        +---+------------+-------+------+
        | ID|   customers|country|CAT_ID|
        +---+------------+-------+------+
        |161| xyz Limited|   U.K.|     1|
        |262|ABC  Limited|   U.K.|     1|
        |165| Sons & Sons|   U.K.|     1|
        |361|    TÜV GmbH|Germany|     2|
        |462|Mueller GmbH|Germany|     2|
        |369|Schneider AG|Germany|     2|
        |467|     Sahm UG|Austria|     2|
        +---+------------+-------+------+

Here, we get for list1 the regular expression (16|26)\d matching 16 or 26 followed by an integer (\d is equivalent to [0-9]).

2. Dynamically build an SQL clause

If you want to keep the sql like, you can use selectExpr and chain the values with ' OR ':

df.selectExpr(
        '*', 
        "CASE WHEN ({}) THEN '1' WHEN ({}) THEN '2' ELSE '999' END AS CAT_ID"
            .format(*[' OR '.join(["ID LIKE '{}%'".format(x) for x in l]) for l in [list1, list2]]))

3. Dynamically build a Python expression

You can also use eval if you don't want to write SQL:

df.withColumn(
        'CAT_ID', 
        psf.when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list1])), '1')
            .when(eval(" | ".join(["psf.col('ID').like('{}%')".format(x) for x in list2])), '2')
            .otherwise('999'))

PySpark - Using lists inside LIKE operator

Answers (2)

Related Questions