Reputation: 4790
I have a csv
file with data like this
ID|Arr_of_Str
1|["ABC DEF"]
2|["PQR", "ABC DEF"]
I want to read this .csv
file, however when I am using sqlContext.read.load
, it is reading it as string
Current:
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: string (nullable = true)
Expected:
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Arr_of_Str: array (nullable = true)
|-- element: string (containsNull = true)
How can I cast string to array of string?
Upvotes: 3
Views: 3112
Reputation: 32660
Update:
Actually, you can simply use from_json
to parse Arr_of_Str
column as array of strings :
from pyspark.sql import functions as F
df2 = df.withColumn(
"Arr_of_Str",
F.from_json(F.col("Arr_of_Str"), "array<string>")
)
df1.show(truncate=False)
#+---+--------------+
#|ID |Arr_of_Str |
#+---+--------------+
#| 1 |[ABC DEF] |
#| 2 |[PQR, ABC DEF]|
#+---+--------------+
Old answer:
You can't do that when reading data as there is no support for complexe data structures in CSV. You'll have to do the transformation after you loaded the DataFrame.
Just remove the array square brackets from the string and split it to get an array column.
from pyspark.sql.functions import split, regexp_replace
df2 = df.withColumn("Arr_of_Str", split(regexp_replace(col("Arr_of_Str"), '[\\[\\]]', ""), ","))
df2.show()
+---+-------------------+
| ID| Arr_of_Str|
+---+-------------------+
| 1| ["ABC DEF"]|
| 2|["PQR", "ABC DEF"]|
+---+-------------------+
df2.printSchema()
root
|-- ID: string (nullable = true)
|-- Arr_of_Str: array (nullable = true)
| |-- element: string (containsNull = true)
Upvotes: 5