Giampaolo Levorato
Giampaolo Levorato

Reputation: 1622

Create sample weight in PySpark sampled dataframe

I have created a dataframe in PySpark as follows:

df = spark.range(10)

The dataframe looks like this:

df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+

I have then taken random sample as follows:

df1 = df.sample(fraction=0.5, seed=123)

The sampled dataframe looks like this:

df1.show()

+---+
| id|
+---+
|  0|
|  2|
|  3|
|  5|
|  6|
|  7|
+---+

I need to create a field called "weight" in the sampled dataframe (df1). I know how to do it in Pandas, but I do not know how to do it in PySpark. Can anyone help me please?

Upvotes: 0

Views: 51

Answers (1)

Giampaolo Levorato
Giampaolo Levorato

Reputation: 1622

Sorted!

frac = 0.5
df1 = df.sample(fraction=frac, seed=123).withColumn("sampleWeight", lit(1/frac))

Upvotes: 1

Related Questions