Attach suffixes to PySpark rows

Question

I have a PySpark dataframe with two columns A and B. Each of these columns are of string datatype.

Following is an example of the dataframe

|-------|-------|
|   A   |   B   |
|-------|-------| 
| "a1"  |  "b1" |
| "a2"  |  "b2" |
| "a3"  |  "b3" |
| "a4"  |  "b4" |
| "a5"  |  "b5" |
| "a6"  |  "b6" |
|-------|-------|

I want to affix the tags _1 , _2 and _3 at random chosen rows to create two new columns A1 and B1 such that the dataframe is now like the following:

|-------|-------|---------|---------|
|   A   |   B   |    A1   |    B1   |
|-------|-------|---------|---------| 
| "a1"  |  "b1" | "a1_1"  |  "b1_1" |
| "a2"  |  "b2" | "a2_2"  |  "b2_2" |
| "a3"  |  "b3" | "a3_1"  |  "b3_1" |
| "a4"  |  "b4" | "a4_3"  |  "b4_3" |
| "a5"  |  "b5" | "a5_3"  |  "b5_3" |
| "a6"  |  "b6" | "a6_2"  |  "b6_2" |
|-------|-------|---------|---------|

Here are a few rules that I need to take care of

Whenever a given suffix (_1 , _2 or _3) is attached to the end of a row in A, the same suffix is attached to the end of the same row in B also.
The numbers of times the suffixes feature should be same (or almost the same). In this example, we have three suffixes (_1 , _2 or _3) and each suffix features two times.
The rows to which a given suffix is attached is chosen randomly.

I would like a solution which works for the aforementioned example. How can I do this using PySpark? In my actual dataframe, I expect to see close to a million rows and about 5 suffixes. So, an efficient solution will be really great here.

mck · Accepted Answer

import pyspark.sql.functions as F

# rand * 3 + 1 to give random numbers in (1,2,3)
df = df.withColumn("rand", (F.rand() * F.lit(3) + F.lit(1)).cast("int"))

a1 = F.concat(
    F.col("A"),
    F.lit("_"),
    F.col("rand")
).alias("A1")

b1 = F.concat(
    F.col("B"),
    F.lit("_"),
    F.col("rand")
).alias("B1")

df = df.select("A", "B", a1, b1)

Attach suffixes to PySpark rows

Answers (1)

Related Questions