Reputation: 771
I have a basic question about dataframes and adding a column that should contain a nested list. This is basically the problem:
b = [[['url.de'],['name']],[['url2.de'],['name2']]]
a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)
list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]
#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]],
[['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]
To Create a new Dataframe from this output, I´m trying to create a new schema:
schema = df.withColumn('NewColumn',array(lit("10"))).schema
I then use it to create the new DataFrame:
df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()
#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]
The Problem now is that, the nested list was transformed into a list with two unicode entries instead of keeping the original format.
I think this is due to my definition of the new Column "... array(lit("10"))".
What do I have to use in order to keep the original format?
Upvotes: 0
Views: 1308
Reputation: 2590
You can directly inspect the schema of the dataframe by calling df.schema
. You can see that in the given scenario we have the following:
StructType(
List(
StructField(URL,ArrayType(StringType,true),true),
StructField(name,ArrayType(StringType,true),true),
StructField(NewColumn,ArrayType(StringType,false),false)
)
)
The NewColumn
that you added is an ArrayType
column whose entries are all StringType
. So anything that is contained in the array will be converted to a string, even if it is itself an array. If you want to have nested arrays (2 layers), then you need to change your schema so that the the NewColumn
field has an ArrayType(ArrayType(StringType,False),False)
type. You can do this by explicitly defining the schema:
from pyspark.sql.types import StructType, StructField, ArrayType, StringType
schema = StructType([
StructField("URL", ArrayType(StringType(),True), True),
StructField("name", ArrayType(StringType(),True), True),
StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])
Or by changing your code by having the NewColumn
be defined by nesting the array
function, array(array())
:
df.withColumn('NewColumn',array(array(lit("10")))).schema
Upvotes: 1