Reputation: 23
pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value 11-20 - group2 . . . 91-100 group10
how can i achieve this using pyspark dataframe
Upvotes: 0
Views: 1541
Reputation: 7597
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame
has been created, we use floor()
function to find the integral part of a number. For eg; floor(15.5)
will be 15
. We need to find the integral part of the Var/10
and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group
to the value. Concatenation can be achieved with concat()
function, but keep in mind that since the prepended word group
is not a column, so we need to put it inside lit()
which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+
Upvotes: 1