Narendra Mohan Prasad
Narendra Mohan Prasad

Reputation: 115

Case insensitive search in array type column spark dataframe

I have spark data frame like following:

 +----------+-------------------------------------------------+  
 |col1      |words                                            |  
 +----------+-------------------------------------------------+  
 |An        |[An, attractive, ,, thin, low, profile]          |  
 |attractive|[An, attractive, ,, thin, low, profile]          |  
 |,         |[An, attractive, ,, thin, low, profile]          |  
 |thin      |[An, attractive, ,, thin, low, profile]          |    
 |rail      |[An, attractive, ,, thin, low, profile]          |  
 |profile   |[An, attractive, ,, thin, low, profile]          |  
 |Lighter   |[Lighter, than, metal, ,, Level, ,, and, tes]    |  
 |than      |[Lighter, than, metal, ,, Level, ,, and, tww]    |  
 |steel     |[Lighter, than, metal, ,, Level, ,, and, test]   |  
 |,         |[Lighter, than, metal, ,, Level, ,, and, Test]   |  
 |Level     |[Lighter, than, metal, ,, Level, ,, and, test]   |  
 |,         |[Lighter, than, metal, ,, Level, ,, and, ste]    |  
 |and       |[Lighter, than, metal, ,, Level, ,, and, ste]    |  
 |Test      |[Lighter, than, metal, ,, Level, ,, and, Ste]    |  
 |Renewable |[Renewable, resource]                            |  
 |Resource  |[Renewable, resource]                            |  
 |No        |[No1, Bal, testme, saves, time, and, money]      |  
 +----------+-------------------------------------------------+  

I want to filter the data from the above column as case insensitive. Currently I am doing like this.

df.filter(array('words, "level")).show(false)

but it is not showing any data. please help me to resolve the issue.

Upvotes: 1

Views: 3700

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

DataSets are much easier to work with than DataFrames, so I suggest you to convert your dataframe to dataset or just create DataSet from your source data.

supposing you have a dataset with case class as

case class data(col1: String, words: Array[String])

For illustration purpose I am creating a temporary dataset as

import sqlContext.implicits._
val ds = Seq(
  data("profile", Array("An", "attractive", "", "", "thin", "low", "profile")),
  data("Lighter", Array("Lighter", "than", "metal", "", "", "Level", "", "", "and", "tes"))
).toDS

which is similar to the dataframe you have

+-------+-----------------------------------------------+
|col1   |words                                          |
+-------+-----------------------------------------------+
|profile|[An, attractive, , , thin, low, profile]       |
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+

You can perform operations similar to RDD on the dataset and filter in the rows that contains Level as

ds.filter(row => row.words.map(element => element.toLowerCase).contains("level"))

the result is

+-------+-----------------------------------------------+
|col1   |words                                          |
+-------+-----------------------------------------------+
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+

Updated

As you are struggling to convert dataframe to dataset, here's one of the method to do it

suppose you have a dataframe (df) as

+---+-------------+--------+---+
|age|maritalStatus|name    |sex|
+---+-------------+--------+---+
|35 |M            |Joanna  |F  |
|25 |S            |Isabelle|F  |
|19 |S            |Andy    |M  |
|70 |M            |Robert  |M  |
+---+-------------+--------+---+

Then a case class should be created to match the schema of df as

case class dataset(age: Int, maritalStatus: String, name: String, sex: String)

Then change of alias should do the trick as

val dataSet : Dataset[dataset] = df.as[dataset]

Then you can proceed as explained in the first part of this answer.

I hope the answer is helpful

Upvotes: 1

koiralo
koiralo

Reputation: 23109

For this you can create a simple udf that converts both the case to lower case and filters

Here is the simple example,

scala> import spark.implicits._
import spark.implicits._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq(("An", List("An", "attractive"," ","", "thin", "low", "profile")), ("Lighter", List("Lighter", "than", "metal"," " ,"", "Level"," " ,"", "and", "tes"))).toDF("col1", "words")
df: org.apache.spark.sql.DataFrame = [col1: string, words: array<string>]

scala> val filterUdf = udf((arr: Seq[String]) => arr.map(_.toLowerCase).contains("level".toLowerCase))
filterUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(StringType,true))))

scala> df.filter(filterUdf($"words")).show(false)

+-------+-------------------------------------------------+
|col1   |words                                            |
+-------+-------------------------------------------------+
|Lighter|[Lighter, than, metal,  , , Level,  , , and, tes]|
+-------+-------------------------------------------------+

Hope this helps!

Upvotes: 2

Related Questions