Split Text in Dataframe and Check if Contains Substring

Question

So I want to check if my text contains the word 'baby' and not any other word that contains 'baby'. For example, "maybaby" would not be a match. I already have piece of code that works, but I wanted to see if there was a better way to format so that I don't have to go through the data twice. Here is what I have thus far:

import pyspark.sql.functions as F

rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds'], ['23-maybaby'], ['54-baby']])

rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')

rows_df = rows_df.withColumn('fruit', split)

+----------+-------------+
|        ID|        fruit|
+----------+-------------+
| 14-banana| [14, banana]|
| 12-cheese| [12, cheese]|
| 13-olives| [13, olives]|
|11-almonds|[11, almonds]|
|23-maybaby|[23, maybaby]|
|   54-baby|   [54, baby]|
+----------+-------------+

from pyspark.sql.types import StringType
def func(col):
  for item in col:
    if item == "baby":
      return "yes"

  return "no"
func_udf = udf(func, StringType())
df_hierachy_concept = rows_df.withColumn('new',func_udf(rows_df['fruit']))

+----------+-------------+---+
|        ID|        fruit|new|
+----------+-------------+---+
| 14-banana| [14, banana]| no|
| 12-cheese| [12, cheese]| no|
| 13-olives| [13, olives]| no|
|11-almonds|[11, almonds]| no|
|23-maybaby|[23, maybaby]| no|
|   54-baby|   [54, baby]|yes|
+----------+-------------+---+

Ultimately, I just want the "ID" and "new" column only.

Kafels · Accepted Answer

I'll show two ways to resolve this. Probably there's a lot other ways to reach the same result.

See the examples below:

from pyspark.shell import sc
from pyspark.sql.functions import split, when

rows = sc.parallelize(
    [
        ['14-banana'], ['12-cheese'], ['13-olives'], 
        ['11-almonds'], ['23-maybaby'], ['54-baby']
    ]
)

# Resolves with auxiliary column named "fruit"
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn('fruit', split(rows_df.ID, '-')[1])

rows_df = rows_df.withColumn('new', when(rows_df.fruit == 'baby', 'yes').otherwise('no'))
rows_df = rows_df.drop('fruit')
rows_df.show()

# Resolves directly without creating an auxiliary column
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn(
    'new',
     when(split(rows_df.ID, '-')[1] == 'baby', 'yes').otherwise('no')
)
rows_df.show()

# Resolves without forcing `split()[1]` call, avoiding out of index exception
rows_df = rows.toDF(["ID"])
is_new_udf = udf(lambda col: 'yes' if any(value == 'baby' for value in col) else 'no')
rows_df = rows_df.withColumn('new', is_new_udf(split(rows_df.ID, '-')))
rows_df.show()

All outputs are the same:

+----------+---+
|        ID|new|
+----------+---+
| 14-banana| no|
| 12-cheese| no|
| 13-olives| no|
|11-almonds| no|
|23-maybaby| no|
|   54-baby|yes|
+----------+---+

Split Text in Dataframe and Check if Contains Substring

Answers (2)

Related Questions