Reputation: 523
I have the following code:
L = {'L1': ['us'] }
#df1 = df1.withColumnRenamed("name","OriginalCompanyName")
for key, vals in L.items():
# regex pattern for extracting vals
pat = r'\\b(%s)\\b' % '|'.join(vals)
# extract matching occurrences
col1 = F.expr("regexp_extract_all(array_join(loc, ' '), '%s')" % pat)
# Mask the rows with null when there are no matches
df1 = df1.withColumn(key, F.when((F.size(col1) == 0), None).otherwise(col1))
it is extracting us
from the column loc
and key
column is us
and null
otherwise. I have also some empty list []
in the column loc
. I want to also put us
in the column key
when loc
is empty. If I change L = {'L1': ['us'] }
to L = {'L1': ['us','[]' }
it doesn't work.
For some reason this code actually eliminates rows when loc
is empty. Can I modify the code?
Hint: empty loc
can be found by the following code:
df1=df1.withColumn('empty_country', when(sf.size('loc')==0,'us'))
data sample
loc
["this is ,us, better life"]
["no one is, in charge"]
["I am, very far, from us"]
[]
loc
["this is ,us, better life"] ["us"]
["no one is, in charge"] null
["I am, very far, from us"] ["us"]
[] ["us"]
Upvotes: 0
Views: 41
Reputation: 1857
Make this change to the last line in the for
loop:
df1 = df1.withColumn(key, f.when((f.size(col1) == 0) & (f.size('loc')!=0), None).when(f.size('loc')==0, f.array(f.lit('us'))).otherwise(col1))
PS: The output of regexp_extract_all
is an array.
Upvotes: 1