Maria Nazari
Maria Nazari

Reputation: 690

Split Text in Dataframe and Check if Contains Substring

So I want to check if my text contains the word 'baby' and not any other word that contains 'baby'. For example, "maybaby" would not be a match. I already have piece of code that works, but I wanted to see if there was a better way to format so that I don't have to go through the data twice. Here is what I have thus far:

import pyspark.sql.functions as F

rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds'], ['23-maybaby'], ['54-baby']])

rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')

rows_df = rows_df.withColumn('fruit', split)

+----------+-------------+
|        ID|        fruit|
+----------+-------------+
| 14-banana| [14, banana]|
| 12-cheese| [12, cheese]|
| 13-olives| [13, olives]|
|11-almonds|[11, almonds]|
|23-maybaby|[23, maybaby]|
|   54-baby|   [54, baby]|
+----------+-------------+

from pyspark.sql.types import StringType
def func(col):
  for item in col:
    if item == "baby":
      return "yes"

  return "no"
func_udf = udf(func, StringType())
df_hierachy_concept = rows_df.withColumn('new',func_udf(rows_df['fruit']))

+----------+-------------+---+
|        ID|        fruit|new|
+----------+-------------+---+
| 14-banana| [14, banana]| no|
| 12-cheese| [12, cheese]| no|
| 13-olives| [13, olives]| no|
|11-almonds|[11, almonds]| no|
|23-maybaby|[23, maybaby]| no|
|   54-baby|   [54, baby]|yes|
+----------+-------------+---+

Ultimately, I just want the "ID" and "new" column only.

Upvotes: 1

Views: 896

Answers (2)

Kafels
Kafels

Reputation: 4059

I'll show two ways to resolve this. Probably there's a lot other ways to reach the same result.

See the examples below:

from pyspark.shell import sc
from pyspark.sql.functions import split, when

rows = sc.parallelize(
    [
        ['14-banana'], ['12-cheese'], ['13-olives'], 
        ['11-almonds'], ['23-maybaby'], ['54-baby']
    ]
)

# Resolves with auxiliary column named "fruit"
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn('fruit', split(rows_df.ID, '-')[1])

rows_df = rows_df.withColumn('new', when(rows_df.fruit == 'baby', 'yes').otherwise('no'))
rows_df = rows_df.drop('fruit')
rows_df.show()

# Resolves directly without creating an auxiliary column
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn(
    'new',
     when(split(rows_df.ID, '-')[1] == 'baby', 'yes').otherwise('no')
)
rows_df.show()

# Resolves without forcing `split()[1]` call, avoiding out of index exception
rows_df = rows.toDF(["ID"])
is_new_udf = udf(lambda col: 'yes' if any(value == 'baby' for value in col) else 'no')
rows_df = rows_df.withColumn('new', is_new_udf(split(rows_df.ID, '-')))
rows_df.show()

All outputs are the same:

+----------+---+
|        ID|new|
+----------+---+
| 14-banana| no|
| 12-cheese| no|
| 13-olives| no|
|11-almonds| no|
|23-maybaby| no|
|   54-baby|yes|
+----------+---+

Upvotes: 2

pault
pault

Reputation: 43504

I'd use pyspark.sql.functions.regexp_extract for this. Make the column new equal to "yes" if you're able to extract the word "baby" with a word boundary on both sides, and "no" otherwise.

from pyspark.sql.functions import regexp_extract, when
rows_df.withColumn(
    'new',
    when(
        regexp_extract("ID", "(?<=(\b|\-))baby(?=(\b|$))", 0) == "baby",
        "yes"
    ).otherwise("no")
).show()
#+----------+-------------+---+
#|        ID|        fruit|new|
#+----------+-------------+---+
#| 14-banana| [14, banana]| no|
#| 12-cheese| [12, cheese]| no|
#| 13-olives| [13, olives]| no|
#|11-almonds|[11, almonds]| no|
#|23-maybaby|[23, maybaby]| no|
#|   54-baby|   [54, baby]|yes|
#+----------+-------------+---+

The last argument to regexp_extract is the index of the match to extract. We pick the first index (index 0). If the pattern doesn't match, an empty string is returned. Finally use when() to check if the extracted string equals the desired value.

The regex pattern means:

  • (?<=(\b|\-)): Positive look-behind for either a word boundary (\b) or a literal hyphen (-).
  • baby: The literal word "baby"
  • (?=(\b|$)): Positive look-ahead for either a word boundary or the end of the line ($).

This method also doesn't require you to first split the string, because it's unclear if that part is needed for your purposes.

Upvotes: 1

Related Questions