bootica
bootica

Reputation: 771

Pyspark / Dataframe: Add new column that keeps nested list as nested list

I have a basic question about dataframes and adding a column that should contain a nested list. This is basically the problem:

b = [[['url.de'],['name']],[['url2.de'],['name2']]]

a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)

list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]

#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]], 
 [['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]

To Create a new Dataframe from this output, I´m trying to create a new schema:

schema = df.withColumn('NewColumn',array(lit("10"))).schema

I then use it to create the new DataFrame:

df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()

#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
 Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]

The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format.

I think this is due to my definition of the new Column "... array(lit("10"))".

What do I have to use in order to keep the original format?

Upvotes: 0

Views: 1308

Answers (1)

DavidWayne
DavidWayne

Reputation: 2590

You can directly inspect the schema of the dataframe by calling df.schema. You can see that in the given scenario we have the following:

StructType(
  List(
    StructField(URL,ArrayType(StringType,true),true),
    StructField(name,ArrayType(StringType,true),true),
    StructField(NewColumn,ArrayType(StringType,false),false)
  )
)

The NewColumn that you added is an ArrayType column whose entries are all StringType. So anything that is contained in the array will be converted to a string, even if it is itself an array. If you want to have nested arrays (2 layers), then you need to change your schema so that the the NewColumn field has an ArrayType(ArrayType(StringType,False),False) type. You can do this by explicitly defining the schema:

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

schema = StructType([
    StructField("URL", ArrayType(StringType(),True), True),
    StructField("name", ArrayType(StringType(),True), True),
    StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])

Or by changing your code by having the NewColumn be defined by nesting the array function, array(array()):

df.withColumn('NewColumn',array(array(lit("10")))).schema

Upvotes: 1

Related Questions