How can I take a dataframe containing lists of strings and create another dataframe from these lists in Pyspark?

Question

Suppose I have a dataframe that looks like this

+--------------------+
|        ColA        |
+--------------------+
| [val1, val2, val3] |
+--------------------+
| [val4, val5, val6] |
+--------------------+
| [val7, val8, val9] |
+--------------------+

How can I create a new dataframe that would look like this?

+------+------+------+
| Col1 | Col2 | Col3 |
+------+------+------+
| val1 | val2 | val3 |
+------+------+------+
| val4 | val5 | val6 |
+------+------+------+
| val7 | val8 | val9 |
+------+------+------+

cph_sto · Accepted Answer

This code is robust enough to take any number of elements in the arrays. Though OP has 3 elements in each array. We start by creating the said DataFrame.

# Loading requisite packages.
from pyspark.sql.functions import col, explode, first, udf
df = sqlContext.createDataFrame([(['val1', 'val2', 'val3'],),
                                 (['val4', 'val5', 'val6'],),
                                 (['val7', 'val8', 'val9'],)],['ColA',])
df.show()
+------------------+
|              ColA|
+------------------+
|[val1, val2, val3]|
|[val4, val5, val6]|
|[val7, val8, val9]|
+------------------+

Since we want to each element of the individual array be marked as a respective column, so as a first step we try to make a mapping between column name and the value. We create a user defined function - (UDF) to achieve this.

def func(c):
    return [['Col'+str(i+1),c[i]] for i in range(len(c))]
func_udf = udf(func,ArrayType(StructType([
      StructField('a', StringType()),
      StructField('b', StringType())
  ])))
df = df.withColumn('ColA_new',func_udf(col('ColA')))
df.show(truncate=False)
+------------------+---------------------------------------+
|ColA              |ColA_new                               |
+------------------+---------------------------------------+
|[val1, val2, val3]|[[Col1,val1], [Col2,val2], [Col3,val3]]|
|[val4, val5, val6]|[[Col1,val4], [Col2,val5], [Col3,val6]]|
|[val7, val8, val9]|[[Col1,val7], [Col2,val8], [Col3,val9]]|
+------------------+---------------------------------------+

Once this has been done, we explode the DataFrame.

# Step 1: Explode the DataFrame
df=df.withColumn('vals', explode('ColA_new')).drop('ColA_new')
df.show()
+------------------+-----------+
|              ColA|       vals|
+------------------+-----------+
|[val1, val2, val3]|[Col1,val1]|
|[val1, val2, val3]|[Col2,val2]|
|[val1, val2, val3]|[Col3,val3]|
|[val4, val5, val6]|[Col1,val4]|
|[val4, val5, val6]|[Col2,val5]|
|[val4, val5, val6]|[Col3,val6]|
|[val7, val8, val9]|[Col1,val7]|
|[val7, val8, val9]|[Col2,val8]|
|[val7, val8, val9]|[Col3,val9]|
+------------------+-----------+

Once exploded, we extract first and the second element, which were named a and b respectively in the UDF.

df=df.withColumn('column_name', col('vals').getItem('a'))
df=df.withColumn('value', col('vals').getItem('b')).drop('vals')
df.show()
+------------------+-----------+-----+
|              ColA|column_name|value|
+------------------+-----------+-----+
|[val1, val2, val3]|       Col1| val1|
|[val1, val2, val3]|       Col2| val2|
|[val1, val2, val3]|       Col3| val3|
|[val4, val5, val6]|       Col1| val4|
|[val4, val5, val6]|       Col2| val5|
|[val4, val5, val6]|       Col3| val6|
|[val7, val8, val9]|       Col1| val7|
|[val7, val8, val9]|       Col2| val8|
|[val7, val8, val9]|       Col3| val9|
+------------------+-----------+-----+

As a last step, we pivot the DataFrame back to obtain the final DataFrame. Since in pivoting we do aggregation, so we aggregate on the basis of first(), which takes the first element of the group.

# Step 2: Pivot it back.
df = df.groupby('ColA').pivot('column_name').agg(first('value')).drop('ColA')
df.show()
+----+----+----+
|Col1|Col2|Col3|
+----+----+----+
|val1|val2|val3|
|val4|val5|val6|
|val7|val8|val9|
+----+----+----+

How can I take a dataframe containing lists of strings and create another dataframe from these lists in Pyspark?

Answers (2)

Related Questions