Reputation: 39910
I have a .csv
file that contains rows with missing values. Those values instead of null
, are denoted by the character ?
.
How can I remove the rows that contain at least one column with value ?
, given that df.na.drop()
won't work (since the missing values are not null
) ?
The data looks like below (I've got 35 columns - missing values can be found in any of those columns)
+-------+--------+------+-------+
| col_1 | col_2 | ... | col_35|
+-------+--------+------+-------+
| 0.75 | ? | ... | 15 |
| ? | Helen | ... | 21 |
| -1.2 | George | ... | ? |
| ? | Andrew | ... | 129 |
| 0.12 | Maria | ... | 12 | // Should not be deleted
+-------+--------+------+-------+
And here's the code that reads the file.
val df = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("data.csv")
.toDF()
Upvotes: 1
Views: 1382
Reputation: 330353
If ?
denotes missing values you just configure the reader to recognize that:
val df = spark.read
.format("csv")
.option("nullValue", "?") // Use "?" as null character
.option("header", "true")
.option("mode", "DROPMALFORMED")
.load("data.csv")
.toDF()
and use standard na.drop
:
df.na.drop
Upvotes: 5
Reputation: 510
You can convert the ?
to null
values using UDFs in spark data frames.
Sample code below:
import org.apache.spark.sql.functions.udf
val df = sc.parallelize(
Seq(("a", "B", "c"), ("D", "e", "?"), ("G", "?", "I"))).toDF("x", "y", "z")
// Function returns the input itself or null if it is a '?'
def replace: (String => String) = { value => if (value == "?") null else value }
// We create a UDF of that function because we want to run this on the entire column
val replaceudf = udf(replace)
Apply the method to all columns of the data frame
df.select(df.columns.map(c => replaceudf(col(c)).alias(c)): _*)
df.show
/* Output
+---+----+----+
| x| y| z|
+---+----+----+
| a| B| c|
| D| e|null|
| G|null| I|
+---+----+----+
*/
Now you can apply all the NA operations on the data frame. I hope this helps.
Upvotes: 2