Carolina
Carolina

Reputation: 135

How can I create a tuple column out of another column?

I have a column of scores and I want to create a column of bands for this score.

Example: Score 623 should have a column with value (600, 625].

In python I used to do this like this:

df['score_band_25'] = df['score'].apply(lambda x: '('+str((int((x-0.1)//25)*25))+', '+str(int((x-0.1)//25)*25+25)+']')

How can I do this in Scala? The best I could come up with so far was:

.withColumn("score_band_25", floor(($"score"-0.1)/25)*25)

but that only gives me the first part of the tuple, and I can't find a way to unite floor(($"score"-0.1)/25)*25) and floor(($"score"-0.1)/25)*25+25) as strings.

Ideally I would like to keep having a ( at the beginning and a ] at the end, but I would be happy with two (600, 625) too.

Upvotes: 0

Views: 168

Answers (1)

Leo C
Leo C

Reputation: 22449

If I understand your requirement correctly, you're trying to create a String-type column of half-open intervals corresponding to the values of an Integer column – in which case you can simply use concat() to concatenate the calculated interval limits:

import org.apache.spark.sql.functions._

val df = Seq(
  12, 149, 300, 623
).toDF("score")

val interval = 25

df.withColumn("score_band_" + interval, concat(
  lit("("),
  floor(($"score" - 0.1) / interval) * interval,
  lit(", "),
  floor(($"score" - 0.1) / interval) * interval + interval,
  lit("]")
)).show
// +-----+-------------+
// |score|score_band_25|
// +-----+-------------+
// |   12|      (0, 25]|
// |  149|   (125, 150]|
// |  300|   (275, 300]|
// |  623|   (600, 625]|
// +-----+-------------+

Note that concat takes Column-type parameters, thus lit() is needed for the literal strings (e.g. "(").

Upvotes: 2

Related Questions