devinbost
devinbost

Reputation: 5074

PySpark value with comma doesn't contain comma?? (Trying to cast to ArrayType(StringType()))

I am running PySpark v1.6.0, and I have a column of string values (according to .printSchema), but when I attempt to filter the rows according to cases where the column value starts with a "[" character or contains a "," character, in both cases, it's saying that the rows that I'm expecting to evaluate to True are still False...

When I run the code:

col_name = "attempt_params_attempt_response_id"
resultDF.select(col_name, resultDF[col_name].like(",")).show(50)

I get:

Picture where column value should be true but is false

I don't understand how this is possible because the string value clearly contains a comma, so that row should return true, not false.

Similarly, when I try casting the rows to ArrayType(StringType()) (which is my ultimate goal), it also behaves as if my rows don't contain a comma...

When I run the code:

from pyspark.sql.types import ArrayType, IntegerType, StringType

col_name = "attempt_params_attempt_response_id"
resultDF.withColumn(col_name, 
                    split(resultDF[col_name], ",\s*")
                    .cast(ArrayType(StringType()))).select(col_name).show(40)

I get the results:

Multidimensional array as if there was no comma to split on

I wonder if perhaps there's some sort of bizarre encoding issue that is causing the character , to not match the character that in the data appears to be a , character... but I really am not sure. Any ideas on why this is happening and how I can actually get the cast to work without creating the text of a multi-dimensional array?

Upvotes: 1

Views: 815

Answers (1)

zero323
zero323

Reputation: 330093

In the case your pattern is wrong. like is equivalent to SQL like using simple regular expression, so , matches only literal ,.

df = spark.createDataFrame([("[0,2,3]", ), (",", )], ("text", ))
df.withColumn("contains_comma", col("text").like(",")).show()
+-------+--------------+
|   text|contains_comma|
+-------+--------------+
|[0,2,3]|         false|
|      ,|          true|
+-------+--------------+

To get a match you should add leading and trailing wildcards:

df.withColumn("contains_comma", col("text").like("%,%")).show()


# +-------+--------------+
# |   text|contains_comma|
# +-------+--------------+
# |[0,2,3]|          true|
# |      ,|          true|
# +-------+--------------+

In the second case there is no problem at all. Since you split on , the first item will contain a leading [

df.withColumn("contains_comma", split("text", ",\s*")[0]).show()
+-------+--------------+
|   text|contains_comma|
+-------+--------------+
|[0,2,3]|            [0|
|      ,|              |
+-------+--------------+

and the last trailing ]. You should strip these if you don't want them in the output, for example using regexp_replace:

split(regexp_replace("text", "^\[|\]$",  ""), ",")

Upvotes: 1

Related Questions