Reputation: 1376
I have a dataframe (with more rows and columns) as shown below.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# | col1| col2| col3|
# +------+-------------+------+
# | z1| [a1, b2, c3]| foo|
# +------+-------------+------+
df
# DataFrame[col1: string, col2: string, col3: string]
What I want:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| foo|
| z1| b2| foo|
| z1| c3| foo|
+-----+-----+-----+
I tried to replicate the RDD
solution provided here: Pyspark: Split multiple array columns into rows
(df
.rdd
.flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
.toDF(["col1", "col2", "col3"]))
However, it is not giving the required result
Edit: The explode
option does not work because it is currently stored as string and the explode
function expects an array
Upvotes: 6
Views: 10687
Reputation: 7336
Pault's solution should work perfectly fine although here is another solution which uses regexp_extract
instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:
from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split
df.withColumn("col2",
explode(
split(
regexp_extract(
regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()
Explanation:
regexp_replace(col("col2"), "\s", "")
will replace all spaces with empty string. regexp_extract
will extract the content of the column which start with [
and ends with ]
.split
for the comma separated values and finally explode
.Upvotes: 2
Reputation: 43544
You can use explode
but first you'll have to convert the string representation of the array into an array.
One way is to use regexp_replace
to remove the leading and trailing square brackets, followed by split
on ", "
.
from pyspark.sql.functions import col, explode, regexp_replace, split
df.withColumn(
"col2",
explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| z1| a1| foo|
#| z1| b2| foo|
#| z1| c3| foo|
#+----+----+----+
Upvotes: 4