Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39910

Remove rows with missing values denoted by '?'

I have a .csv file that contains rows with missing values. Those values instead of null, are denoted by the character ?.

How can I remove the rows that contain at least one column with value ?, given that df.na.drop() won't work (since the missing values are not null) ?

The data looks like below (I've got 35 columns - missing values can be found in any of those columns)

+-------+--------+------+-------+
| col_1 | col_2  |  ... | col_35|
+-------+--------+------+-------+
| 0.75  |   ?    |  ... |   15  |
|   ?   | Helen  |  ... |   21  |
| -1.2  | George |  ... |    ?  |
|   ?   | Andrew |  ... |   129 |
| 0.12  | Maria  |  ... |   12  |   // Should not be deleted
+-------+--------+------+-------+

And here's the code that reads the file.

val df = sparkSession.read
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("data.csv")
    .toDF()

Upvotes: 1

Views: 1382

Answers (2)

zero323
zero323

Reputation: 330353

If ? denotes missing values you just configure the reader to recognize that:

val df = spark.read
  .format("csv")
  .option("nullValue", "?")  // Use "?" as null character
  .option("header", "true")
  .option("mode", "DROPMALFORMED")
  .load("data.csv")
  .toDF()

and use standard na.drop:

df.na.drop

Upvotes: 5

ar7
ar7

Reputation: 510

You can convert the ? to null values using UDFs in spark data frames.

Sample code below:

import org.apache.spark.sql.functions.udf

val df = sc.parallelize(
  Seq(("a", "B", "c"), ("D", "e", "?"), ("G", "?", "I"))).toDF("x", "y", "z")
// Function returns the input itself or null if it is a '?'
def replace: (String => String) = { value => if (value == "?") null else value }
// We create a UDF of that function because we want to run this on the entire column
val replaceudf = udf(replace)
Apply the method to all columns of the data frame
df.select(df.columns.map(c => replaceudf(col(c)).alias(c)): _*)

df.show
/* Output
+---+----+----+
|  x|   y|   z|
+---+----+----+
|  a|   B|   c|
|  D|   e|null|
|  G|null|   I|
+---+----+----+
*/

Now you can apply all the NA operations on the data frame. I hope this helps.

Upvotes: 2

Related Questions