Reputation: 146
I have a PySpark Dataframe, i want to update or create a column base on a pattern. My goal is to replace col2 of Initial dataframe or create new column that will contain the 2 first rows as pattern (2.54, 3.21) that will be replicate on others rows of col2 like target dataframe below.
Initial dataframe:
| ident1 | Col2 | Tag |
|:--- ---|:-----:|:----:|
| id1 | 2.54 | A |
| id2 | 3.21 | A |
| id3 | 2.23 | D |
| id4 | 3.41 | A |
| id5 | 2.45 | C |
| id6 | 3.11 | D |
Target Dataframe:
| ident1 | Col2 | Tag |
|:--- ---|:-----:|:----:|
| id1 | 2.54 | A |
| id2 | 3.21 | A |
| id3 | 2.54 | D |
| id4 | 3.21 | A |
| id5 | 2.54 | C |
| id6 | 3.21 | D |
How can do it ? for small database, i can to it with loop for or list functions. But i deal with big database.
Upvotes: 0
Views: 50
Reputation: 3348
This will work -
Input:
df = spark.createDataFrame([["1", 2.54,"A"], ["2", 3.21,"B"], ["3", 3.54,"C"], ["4", 4.21,"D"], ["5", 5.54,"C"]]).toDF("id", "col2", "tag")
df.show()
Output:
df = df.withColumn("rn", F.row_number().over(Window.orderBy("Id")));
firstTwoRows=df.withColumn("rn", F.row_number().over(Window.orderBy("Id"))).filter(F.col("rn") <=2).collect()
df.withColumn("col2", F.when(F.col("rn")%2 == 0, F.lit(firstTwoRows[0]["col2"])).otherwise(F.lit(firstTwoRows[1]["col2"]))).drop("rn").show()
The assumption here is by first two rows you mean the first two ids, if this is not the case, simply remove the orderBy with a dummy orderBy(lit("A"))
Upvotes: 1