Reputation: 2050
I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of "yes" and "no", they should match "yes23" and "no3" but not "35yes" or "41no". Using PySpark, how can I use startswith
to match elements in list or tuple.
An example DF would be:
+-----+------+
|index| label|
+-----+------+
| 1|yes342|
| 2| 45yes|
| 3| no123|
| 4| 75no|
+-----+------+
When I try:
Element_List = ['yes', 'no']
filter_DF = DF.where(DF.label.startswith(tuple(Element_List)))
The resulting df should look something like:
+-----+------+
|index| label|
+-----+------+
| 1|yes342|
| 3| no123|
+-----+------+
Instead, I get the error:
Py4JError: An error occurred while calling o250.startsWith. Trace: py4j.Py4JException: Method startsWith([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)
It looks like startsWith
can't be used with any type of list. Is there a simple workaround?
Upvotes: 13
Views: 38330
Reputation: 24416
When the elements in the list are of the same length, we can do this:
element_list = ['ye', 'no']
df = df.filter(F.col('label')[0:2].isin(element_list))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'yes342'),
(2, '45yes'),
(3, 'no123'),
(4, '75no')],
['index', 'label'])
df.show()
# +-----+------+
# |index| label|
# +-----+------+
# | 1|yes342|
# | 2| 45yes|
# | 3| no123|
# | 4| 75no|
# +-----+------+
element_list = ['ye', 'no']
df = df.filter(F.col('label')[0:2].isin(element_list))
df.show()
# +-----+------+
# |index| label|
# +-----+------+
# | 1|yes342|
# | 3| no123|
# +-----+------+
Upvotes: 0
Reputation: 96
I feel best way to achieve this is with native PySpark function like rlike()
. startswith()
is meant for filtering the static strings. It can't accept dynamic content. If you want to dynamically take the keywords from list, the best bet can be creating a regular expression from the list as below.
# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"
Custom UDFs or working with RDD functions might work, but usage of custom UDFs might hinder the performance.
Below is the complete working example.
#Test Dataframe
df = spark.createDataFrame(
["yes23", "no3", "35yes"],
"string"
).toDF("label")
# List
li = ['yes', 'no']
# frame RegEx from the List
# in this case strings starting with yes/no i.e. "^(yes|no)"
reg_str = r"^("+ "|".join(li) + ")"
#Filter dataframe with RegEx
df.filter(df.label.rlike(reg_str)).show()
# +--------+
# |label |
# +--------+
# | yes23|
# | no3|
# +--------+
Upvotes: 3
Reputation: 35249
Compose expression like this:
from pyspark.sql.functions import col, lit
from functools import reduce
element_list = ['yes','no']
df = spark.createDataFrame(
["yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
starts_with = reduce(
lambda x, y: x | y,
[col("location").startswith(s) for s in element_list],
lit(False))
df.where(starts_with).show()
# +--------+
# |location|
# +--------+
# | yes23|
# | no3|
# +--------+
Upvotes: 25