Reputation: 592
I have a PySpark dataframe with two columns A
and B
. Each of these columns are of string datatype.
Following is an example of the dataframe
|-------|-------|
| A | B |
|-------|-------|
| "a1" | "b1" |
| "a2" | "b2" |
| "a3" | "b3" |
| "a4" | "b4" |
| "a5" | "b5" |
| "a6" | "b6" |
|-------|-------|
I want to affix the tags _1
, _2
and _3
at random chosen rows to create two new columns A1
and B1
such that the dataframe is now like the following:
|-------|-------|---------|---------|
| A | B | A1 | B1 |
|-------|-------|---------|---------|
| "a1" | "b1" | "a1_1" | "b1_1" |
| "a2" | "b2" | "a2_2" | "b2_2" |
| "a3" | "b3" | "a3_1" | "b3_1" |
| "a4" | "b4" | "a4_3" | "b4_3" |
| "a5" | "b5" | "a5_3" | "b5_3" |
| "a6" | "b6" | "a6_2" | "b6_2" |
|-------|-------|---------|---------|
Here are a few rules that I need to take care of
_1
, _2
or _3
) is attached to the end of a row in A
, the same suffix is attached to the end of the same row in B
also._1
, _2
or _3
) and each suffix features two times.I would like a solution which works for the aforementioned example. How can I do this using PySpark? In my actual dataframe, I expect to see close to a million rows and about 5 suffixes. So, an efficient solution will be really great here.
Upvotes: 3
Views: 1060
Reputation: 42422
import pyspark.sql.functions as F
# rand * 3 + 1 to give random numbers in (1,2,3)
df = df.withColumn("rand", (F.rand() * F.lit(3) + F.lit(1)).cast("int"))
a1 = F.concat(
F.col("A"),
F.lit("_"),
F.col("rand")
).alias("A1")
b1 = F.concat(
F.col("B"),
F.lit("_"),
F.col("rand")
).alias("B1")
df = df.select("A", "B", a1, b1)
Upvotes: 4