How to add a completely irrelevant column to a data frame when using pyspark, spark + databricks

Question

Let's say I have a data frame:

myGraph=spark.createDataFrame([(1.3,2.1,3.0),
                               (2.5,4.6,3.1),
                               (6.5,7.2,10.0)],
                              ['col1','col2','col3'])

I want to add a new string column so that it looks like:

from pyspark.sql.functions import lit
myGraph=myGraph.withColumn('rowName',lit('xxx'))

Until here, the values in rowName are all 'xxx'. But I do not know how to add a new column values ('col1','col2','col3') into the rowName?

abiratsis · Accepted Answer

You can create a random int value (1-N) using the build-in rand() function and a udf helper function to generate the new string as next:

val randColumnUDF = udf((rand: Long) => s"X${rand}")
val N = 10000

df.withColumn("rand", randColumnUDF(rand() * N)).show(false)

+----+
|rand|
+----+
|X1  |
|X8  |
|X6  |
|... |
+----+

The code above will append a random number between 1 - 10000 to X producing values: X1, X23, ... etc

How to add a completely irrelevant column to a data frame when using pyspark, spark + databricks

Answers (1)

Related Questions