Reputation: 690
So I want to check if my text contains the word 'baby' and not any other word that contains 'baby'. For example, "maybaby" would not be a match. I already have piece of code that works, but I wanted to see if there was a better way to format so that I don't have to go through the data twice. Here is what I have thus far:
import pyspark.sql.functions as F
rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds'], ['23-maybaby'], ['54-baby']])
rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')
rows_df = rows_df.withColumn('fruit', split)
+----------+-------------+
| ID| fruit|
+----------+-------------+
| 14-banana| [14, banana]|
| 12-cheese| [12, cheese]|
| 13-olives| [13, olives]|
|11-almonds|[11, almonds]|
|23-maybaby|[23, maybaby]|
| 54-baby| [54, baby]|
+----------+-------------+
from pyspark.sql.types import StringType
def func(col):
for item in col:
if item == "baby":
return "yes"
return "no"
func_udf = udf(func, StringType())
df_hierachy_concept = rows_df.withColumn('new',func_udf(rows_df['fruit']))
+----------+-------------+---+
| ID| fruit|new|
+----------+-------------+---+
| 14-banana| [14, banana]| no|
| 12-cheese| [12, cheese]| no|
| 13-olives| [13, olives]| no|
|11-almonds|[11, almonds]| no|
|23-maybaby|[23, maybaby]| no|
| 54-baby| [54, baby]|yes|
+----------+-------------+---+
Ultimately, I just want the "ID" and "new" column only.
Upvotes: 1
Views: 896
Reputation: 4059
I'll show two ways to resolve this. Probably there's a lot other ways to reach the same result.
See the examples below:
from pyspark.shell import sc
from pyspark.sql.functions import split, when
rows = sc.parallelize(
[
['14-banana'], ['12-cheese'], ['13-olives'],
['11-almonds'], ['23-maybaby'], ['54-baby']
]
)
# Resolves with auxiliary column named "fruit"
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn('fruit', split(rows_df.ID, '-')[1])
rows_df = rows_df.withColumn('new', when(rows_df.fruit == 'baby', 'yes').otherwise('no'))
rows_df = rows_df.drop('fruit')
rows_df.show()
# Resolves directly without creating an auxiliary column
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn(
'new',
when(split(rows_df.ID, '-')[1] == 'baby', 'yes').otherwise('no')
)
rows_df.show()
# Resolves without forcing `split()[1]` call, avoiding out of index exception
rows_df = rows.toDF(["ID"])
is_new_udf = udf(lambda col: 'yes' if any(value == 'baby' for value in col) else 'no')
rows_df = rows_df.withColumn('new', is_new_udf(split(rows_df.ID, '-')))
rows_df.show()
All outputs are the same:
+----------+---+
| ID|new|
+----------+---+
| 14-banana| no|
| 12-cheese| no|
| 13-olives| no|
|11-almonds| no|
|23-maybaby| no|
| 54-baby|yes|
+----------+---+
Upvotes: 2
Reputation: 43504
I'd use pyspark.sql.functions.regexp_extract
for this. Make the column new
equal to "yes"
if you're able to extract the word "baby"
with a word boundary on both sides, and "no"
otherwise.
from pyspark.sql.functions import regexp_extract, when
rows_df.withColumn(
'new',
when(
regexp_extract("ID", "(?<=(\b|\-))baby(?=(\b|$))", 0) == "baby",
"yes"
).otherwise("no")
).show()
#+----------+-------------+---+
#| ID| fruit|new|
#+----------+-------------+---+
#| 14-banana| [14, banana]| no|
#| 12-cheese| [12, cheese]| no|
#| 13-olives| [13, olives]| no|
#|11-almonds|[11, almonds]| no|
#|23-maybaby|[23, maybaby]| no|
#| 54-baby| [54, baby]|yes|
#+----------+-------------+---+
The last argument to regexp_extract
is the index of the match to extract. We pick the first index (index 0
). If the pattern doesn't match, an empty string is returned. Finally use when()
to check if the extracted string equals the desired value.
The regex pattern means:
(?<=(\b|\-))
: Positive look-behind for either a word boundary (\b
) or a literal hyphen (-
).baby
: The literal word "baby"
(?=(\b|$))
: Positive look-ahead for either a word boundary or the end of the line ($
).This method also doesn't require you to first split the string, because it's unclear if that part is needed for your purposes.
Upvotes: 1