Reputation: 5074
I am running PySpark v1.6.0, and I have a column of string values (according to .printSchema), but when I attempt to filter the rows according to cases where the column value starts with a "[" character or contains a "," character, in both cases, it's saying that the rows that I'm expecting to evaluate to True are still False...
When I run the code:
col_name = "attempt_params_attempt_response_id"
resultDF.select(col_name, resultDF[col_name].like(",")).show(50)
I get:
I don't understand how this is possible because the string value clearly contains a comma, so that row should return true
, not false
.
Similarly, when I try casting the rows to ArrayType(StringType())
(which is my ultimate goal), it also behaves as if my rows don't contain a comma...
When I run the code:
from pyspark.sql.types import ArrayType, IntegerType, StringType
col_name = "attempt_params_attempt_response_id"
resultDF.withColumn(col_name,
split(resultDF[col_name], ",\s*")
.cast(ArrayType(StringType()))).select(col_name).show(40)
I get the results:
I wonder if perhaps there's some sort of bizarre encoding issue that is causing the character ,
to not match the character that in the data appears to be a ,
character... but I really am not sure. Any ideas on why this is happening and how I can actually get the cast to work without creating the text of a multi-dimensional array?
Upvotes: 1
Views: 815
Reputation: 330093
In the case your pattern is wrong. like
is equivalent to SQL like using simple regular expression, so ,
matches only literal ,
.
df = spark.createDataFrame([("[0,2,3]", ), (",", )], ("text", ))
df.withColumn("contains_comma", col("text").like(",")).show()
+-------+--------------+
| text|contains_comma|
+-------+--------------+
|[0,2,3]| false|
| ,| true|
+-------+--------------+
To get a match you should add leading and trailing wildcards:
df.withColumn("contains_comma", col("text").like("%,%")).show()
# +-------+--------------+
# | text|contains_comma|
# +-------+--------------+
# |[0,2,3]| true|
# | ,| true|
# +-------+--------------+
In the second case there is no problem at all. Since you split on ,
the first item will contain a leading [
df.withColumn("contains_comma", split("text", ",\s*")[0]).show()
+-------+--------------+
| text|contains_comma|
+-------+--------------+
|[0,2,3]| [0|
| ,| |
+-------+--------------+
and the last trailing ]
. You should strip these if you don't want them in the output, for example using regexp_replace
:
split(regexp_replace("text", "^\[|\]$", ""), ",")
Upvotes: 1