Dynamic substring without udf

Question

I have dataframe that looks like this

+------------------+
|            domain|
+------------------+
|      domain.co.uk|
|    somedomain.net|
|someotherdomain.fi|
+------------------+

I now want to check the domain ending, look it up in a dictionary, where the domains are located, and add this information in an additional column. My dictionary would look like this, in this simple case:

d = {
     'eu': ['fi', 'uk'],
     'us': ['net']
    }

and hence the dataframe that I would like to have should look like this:

+------------------+--------+
|            domain|location|
+------------------+--------+
|      domain.co.uk|      eu|
|    somedomain.net|      us|
|someotherdomain.fi|      eu|
+------------------+--------+

I first tried to get the substring of the domain after the last point, in python that would be something like

domain[domain.rfind('.')+1:]

but in pyspark I don't know how to do this dynamically, except for using a udf. However, if I implement a simple udf, like

udf1 = udf(lambda x: x[1:], StringType())
df = df.withColumn('xxx', udf1(domain))

it hangs (24 hours with no sign of progress, albeit processing only .00001 % of the data). I understand that udf's are expensive. Can I do this without using udf?

jxc · Accepted Answer

Just use substring_index and convert dict into a mapping:

from pyspark.sql.functions import create_map, lit, substring_index
from itertools import chain

# convert dict into a list of tuple
d_new = [ (k,v) for v,ks in d.items() for k in ks ]
# [('fi', 'eu'), ('uk', 'eu'), ('net', 'us')]

# create mapping from the list of tuple 
mapping = create_map([lit(i) for i in chain.from_iterable(d_new)])
# Column

df_new = df.withColumn('location', mapping[substring_index('domain', '.', -1)])

Dynamic substring without udf

Answers (1)

Related Questions