ab AL
ab AL

Reputation: 31

How to check if a value in a row is empty in spark

I have a dataframe df which I read from a json file:

val df = spark.read.json("C:\\filepath\\file.json") which have the following data

Id downloadUrl title
52193 https://... Title...
5441 https://... Title...
5280 null null
5190 https://... Title...
5215 https://... Title...
1245 https://... Title...
339 null Editorial
59 https://... Title...

Now I want to create a new dataframe or rdd that only have rows downloadUrl and title not null.

  df.map(row=>{
    // here I want to see if the downloadUrl is null
    // do something

    // else if the title is null
    // do something

    // else
    // create a new dataframe df1 with a new column "allowed" with the value set to 1 
    // push df1 to API

  })

Upvotes: 0

Views: 2097

Answers (1)

okmich
okmich

Reputation: 740

  df.map(row=>{
    // here I want to see if the downloadUrl is null
    // do something

    // else if the title is null
    // do something

    // else
    // create a new dataframe df1 with a new column "allowed" with the value set to 1 
    // push df1 to API
  })

Not sure what you mean by if title/downloadUrl is null do something

But if you want a new dataframe that only have rows downloadUrl and title not null. Try using this dataset method

case class MyObject(id:Int, downloadUrl: String, title: String)
val df = spark.read.json("C:\\filepath\\file.json").as[MyObject]
val df1 = df.filter(o => o.downloadUrl =! null && o.title != null)

Another way would be using the filter function as below

val df1 = df.filter(row=>{
    val downloadUrl = row.getAs[String]("downloadUrl")
    val title = row.getAs[String]("title")
    // here I want to see if the downloadUrl is null
    // do something

    // else if the title is null
    // do something

    // else
    // create a new dataframe df1 with a new column "allowed" with the value set to 1 
    return title != null && downloadUrl != null
  })

Lastly if you want to push reach row to an external API, use a foreach each instead. Then use the predicate to determine whether the row should be pushed

  df.foreach(row=>{
    val downloadUrl = row.getAs[String]("downloadUrl")
    val title = row.getAs[String]("title")
    // here I want to see if the downloadUrl is null
    // do something

    // else if the title is null
    // do something

    // else
    // create a new dataframe df1 with a new column "allowed" with the value set to 1 
    if (title != null && downloadUrl != null){
        //call the API here
    }
  })

But in this case we are not creating a new dataframe - df1

Upvotes: 1

Related Questions