Reputation: 681
I am using the PySpark
dataframe. My dataset contains three attributes, id
, name
and address
. I am trying to delete the corresponding row based on the name
value. What I've been trying is to get unique id
of the row I want to delete
ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
The output I am getting is the following: [Row(id='382')]
I am wondering how can I use id
to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing all values == "Bruce"
with "John"
Upvotes: 2
Views: 12325
Reputation: 43524
From the docs for pyspark.sql.DataFrame.collect()
, the function:
Returns all the records as a list of Row.
The fields in a pyspark.sql.Row
can be accessed like dictionary values.
So for your example:
ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect()
#[Row(id='382')]
You can access the id
field by doing:
id_vals = [r['id'] for r in ID]
#['382']
But looking up one value at a time is generally a bad use for spark DataFrames. You should think about your end goal, and see if there's a better way to do it.
EDIT
Base on your comments, it seems you want to replace the values in the name column with another value. One way to do this is by using pyspark.sql.functions.when()
.
This function takes a boolean column expression as the first argument. I am using f.col("name") == "Bruce"
. The second argument is what should be returned if the boolean expression is True
. For this example, I am using f.lit(replacement_value)
.
For example:
import pyspark.sql.functions as f
replacement_value = "Wayne"
df = df.withColumn(
"name",
f.when(f.col("name") == "Bruce", f.lit(replacement_value)).otherwise(f.col("name"))
)
Upvotes: 4