Reputation: 115
I have spark data frame like following:
+----------+-------------------------------------------------+
|col1 |words |
+----------+-------------------------------------------------+
|An |[An, attractive, ,, thin, low, profile] |
|attractive|[An, attractive, ,, thin, low, profile] |
|, |[An, attractive, ,, thin, low, profile] |
|thin |[An, attractive, ,, thin, low, profile] |
|rail |[An, attractive, ,, thin, low, profile] |
|profile |[An, attractive, ,, thin, low, profile] |
|Lighter |[Lighter, than, metal, ,, Level, ,, and, tes] |
|than |[Lighter, than, metal, ,, Level, ,, and, tww] |
|steel |[Lighter, than, metal, ,, Level, ,, and, test] |
|, |[Lighter, than, metal, ,, Level, ,, and, Test] |
|Level |[Lighter, than, metal, ,, Level, ,, and, test] |
|, |[Lighter, than, metal, ,, Level, ,, and, ste] |
|and |[Lighter, than, metal, ,, Level, ,, and, ste] |
|Test |[Lighter, than, metal, ,, Level, ,, and, Ste] |
|Renewable |[Renewable, resource] |
|Resource |[Renewable, resource] |
|No |[No1, Bal, testme, saves, time, and, money] |
+----------+-------------------------------------------------+
I want to filter the data from the above column as case insensitive. Currently I am doing like this.
df.filter(array('words, "level")).show(false)
but it is not showing any data. please help me to resolve the issue.
Upvotes: 1
Views: 3700
Reputation: 41957
DataSets
are much easier to work with than DataFrames
, so I suggest you to convert your dataframe
to dataset
or just create DataSet
from your source data.
supposing you have a dataset with case class
as
case class data(col1: String, words: Array[String])
For illustration purpose I am creating a temporary dataset
as
import sqlContext.implicits._
val ds = Seq(
data("profile", Array("An", "attractive", "", "", "thin", "low", "profile")),
data("Lighter", Array("Lighter", "than", "metal", "", "", "Level", "", "", "and", "tes"))
).toDS
which is similar to the dataframe
you have
+-------+-----------------------------------------------+
|col1 |words |
+-------+-----------------------------------------------+
|profile|[An, attractive, , , thin, low, profile] |
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+
You can perform operations similar to RDD
on the dataset
and filter
in the rows
that contains Level
as
ds.filter(row => row.words.map(element => element.toLowerCase).contains("level"))
the result is
+-------+-----------------------------------------------+
|col1 |words |
+-------+-----------------------------------------------+
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+
Updated
As you are struggling to convert dataframe
to dataset
, here's one of the method to do it
suppose you have a dataframe
(df
) as
+---+-------------+--------+---+
|age|maritalStatus|name |sex|
+---+-------------+--------+---+
|35 |M |Joanna |F |
|25 |S |Isabelle|F |
|19 |S |Andy |M |
|70 |M |Robert |M |
+---+-------------+--------+---+
Then a case class
should be created to match the schema
of df
as
case class dataset(age: Int, maritalStatus: String, name: String, sex: String)
Then change of alias should do the trick as
val dataSet : Dataset[dataset] = df.as[dataset]
Then you can proceed as explained in the first part of this answer.
I hope the answer is helpful
Upvotes: 1
Reputation: 23109
For this you can create a simple udf that converts both the case to lower case and filters
Here is the simple example,
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq(("An", List("An", "attractive"," ","", "thin", "low", "profile")), ("Lighter", List("Lighter", "than", "metal"," " ,"", "Level"," " ,"", "and", "tes"))).toDF("col1", "words")
df: org.apache.spark.sql.DataFrame = [col1: string, words: array<string>]
scala> val filterUdf = udf((arr: Seq[String]) => arr.map(_.toLowerCase).contains("level".toLowerCase))
filterUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(StringType,true))))
scala> df.filter(filterUdf($"words")).show(false)
+-------+-------------------------------------------------+
|col1 |words |
+-------+-------------------------------------------------+
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-------------------------------------------------+
Hope this helps!
Upvotes: 2