Issouf
Issouf

Reputation: 146

Pyspark dataframe multiply patern in column

I have a PySpark Dataframe, i want to update or create a column base on a pattern. My goal is to replace col2 of Initial dataframe or create new column that will contain the 2 first rows as pattern (2.54, 3.21) that will be replicate on others rows of col2 like target dataframe below.

Initial dataframe:

| ident1 |  Col2 |  Tag |
|:--- ---|:-----:|:----:|
| id1    | 2.54  |  A   |
| id2    | 3.21  |   A  |
| id3    | 2.23  |  D   |
| id4    | 3.41  |  A   |
| id5    | 2.45  |   C  |
| id6    | 3.11  |   D  |

Target Dataframe:

| ident1 |  Col2 |  Tag |
|:--- ---|:-----:|:----:|
| id1    | 2.54  |  A   |
| id2    | 3.21  |   A  |
| id3    | 2.54  |  D   |
| id4    | 3.21  |  A   |
| id5    | 2.54  |   C  |
| id6    | 3.21  |   D  |

How can do it ? for small database, i can to it with loop for or list functions. But i deal with big database.

Upvotes: 0

Views: 50

Answers (1)

Ronak Jain
Ronak Jain

Reputation: 3348

This will work -

Input:

df = spark.createDataFrame([["1", 2.54,"A"], ["2", 3.21,"B"], ["3", 3.54,"C"], ["4", 4.21,"D"], ["5", 5.54,"C"]]).toDF("id", "col2", "tag")
df.show()

Input

Output:

df = df.withColumn("rn", F.row_number().over(Window.orderBy("Id")));
firstTwoRows=df.withColumn("rn", F.row_number().over(Window.orderBy("Id"))).filter(F.col("rn") <=2).collect()

df.withColumn("col2", F.when(F.col("rn")%2 == 0, F.lit(firstTwoRows[0]["col2"])).otherwise(F.lit(firstTwoRows[1]["col2"]))).drop("rn").show()

Output

The assumption here is by first two rows you mean the first two ids, if this is not the case, simply remove the orderBy with a dummy orderBy(lit("A"))

Upvotes: 1

Related Questions