Reputation: 15350
I have dataframe that looks like this
+------------------+
| domain|
+------------------+
| domain.co.uk|
| somedomain.net|
|someotherdomain.fi|
+------------------+
I now want to check the domain ending, look it up in a dictionary, where the domains are located, and add this information in an additional column. My dictionary would look like this, in this simple case:
d = {
'eu': ['fi', 'uk'],
'us': ['net']
}
and hence the dataframe that I would like to have should look like this:
+------------------+--------+
| domain|location|
+------------------+--------+
| domain.co.uk| eu|
| somedomain.net| us|
|someotherdomain.fi| eu|
+------------------+--------+
I first tried to get the substring of the domain after the last point, in python that would be something like
domain[domain.rfind('.')+1:]
but in pyspark I don't know how to do this dynamically, except for using a udf. However, if I implement a simple udf, like
udf1 = udf(lambda x: x[1:], StringType())
df = df.withColumn('xxx', udf1(domain))
it hangs (24 hours with no sign of progress, albeit processing only .00001 % of the data). I understand that udf's are expensive. Can I do this without using udf?
Upvotes: 0
Views: 286
Reputation: 13998
Just use substring_index and convert dict into a mapping:
from pyspark.sql.functions import create_map, lit, substring_index
from itertools import chain
# convert dict into a list of tuple
d_new = [ (k,v) for v,ks in d.items() for k in ks ]
# [('fi', 'eu'), ('uk', 'eu'), ('net', 'us')]
# create mapping from the list of tuple
mapping = create_map([lit(i) for i in chain.from_iterable(d_new)])
# Column<b'map(fi, eu, uk, eu, net, us)'>
df_new = df.withColumn('location', mapping[substring_index('domain', '.', -1)])
Upvotes: 1