neha
neha

Reputation: 49

How to Split rows to different columns in Spark DataFrame/DataSet?

Suppose I have data set like :

Name | Subject | Y1  | Y2 
A    | math    | 1998| 2000
B    |         | 1996| 1999
     | science | 2004| 2005

I want to split rows of this data set such that Y2 column will be eliminated like :

Name | Subject | Y1
A    | math    | 1998
A    | math    | 1999
A    | math    | 2000
B    |         | 1996
B    |         | 1997
B    |         | 1998
B    |         | 1999
     | science | 2004
     | science | 2005

Can someone suggest something here ? I hope I had made my query clear. Thanks in advance.

Upvotes: 2

Views: 1064

Answers (4)

Carlos Vilchez
Carlos Vilchez

Reputation: 2804

I think you only need to create an udf to create the range. Then you can use explode to create the necessary rows:

val createRange = udf { (yearFrom: Int, yearTo: Int) =>
    (yearFrom to yearTo).toList
}

df.select($"Name", $"Subject", functions.explode(createRange($"Y1", $"Y2"))).show()

EDIT: The python version of this code would be something like:

from pyspark.sql import Row
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import IntegerType

createRange=udf( lambda (yearFrom, yearTo): list(range(yearFrom, yearTo)), IntegerType())

df.select($"Name", $"Subject", explode(createRange($"Y1", $"Y2"))).show()

Upvotes: 4

Shivansh
Shivansh

Reputation: 3544

Here is the way in which you can implement this :

  val resultantDF= df.rdd.flatMap{row =>
    val rangeInitial = row.getInt(2)
    val rangeEnd = row.getInt(3)
    val array = rangeInitial to rangeEnd
    (List.fill(array.size)(row.getString(0)),List.fill(array.size)(row.getString(1)),array).zipped.toList
    }.toDF("Name","Subject","Y1")

resultantDF.show()

Upvotes: 0

Manu Valdés
Manu Valdés

Reputation: 2372

I have tested this code in pyspark and it works as expected:

data= sc.parallelize([["A","math",1998,2000],["B","",1996,1999],["","science",2004,2005]]

data.map(lambda reg: ((reg[0],reg[1]),(range(reg[2],reg[3]+1))) )
    .flatMapValues(lambda reg: reg).collect()

In more detail, you need to convert the input data to a pair RDD in the form (key,value), where key is composed with the first two fields, since the result will be flattened keeping the key intact with flatMapValues. The values to be mapped are constructed as a range from Y1 to Y2. All of this is done in the first map.

flatMapValues will return each of the values in the range associated to its key.

The output looks like this:

[(('A', 'math'), 1998),
 (('A', 'math'), 1999),
 (('A', 'math'), 2000),
 (('B', ''), 1996),
 (('B', ''), 1997),
 (('B', ''), 1998),
 (('B', ''), 1999),
 (('', 'science'), 2004),
 (('', 'science'), 2005)]

Upvotes: 1

Hasson
Hasson

Reputation: 1914

You can use spark select easily to get what you want in a Data frame, or even in RDD.

Dataset<Row> sqlDF = spark.sql("SELECT Name,Subject,Y1 FROM tableName");

if you are starting from already exesting Data frame, say users, you can use something like this:

resultDF = usersDF.select("Name","Subject","Y1");

Upvotes: -1

Related Questions