Reputation: 1622
I have created a dataframe in PySpark as follows:
df = spark.range(10)
The dataframe looks like this:
df.show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
I have then taken random sample as follows:
df1 = df.sample(fraction=0.5, seed=123)
The sampled dataframe looks like this:
df1.show()
+---+
| id|
+---+
| 0|
| 2|
| 3|
| 5|
| 6|
| 7|
+---+
I need to create a field called "weight" in the sampled dataframe (df1
). I know how to do it in Pandas, but I do not know how to do it in PySpark. Can anyone help me please?
Upvotes: 0
Views: 51
Reputation: 1622
Sorted!
frac = 0.5
df1 = df.sample(fraction=frac, seed=123).withColumn("sampleWeight", lit(1/frac))
Upvotes: 1