Reputation: 382
I am a newbie to azure spark/ databricks and trying to access specific row e.g. 10th row in the dataframe.
This is what I did in notebook so far
1. Read a CSV file in a table
spark.read
.format("csv")
.option("header", "true")
.load("/mnt/training/enb/commonfiles/ramp.csv")
.write
.mode("overwrite")
.saveAsTable("ramp_csv")
2. Create a DataFrame for the "table" ramp_csv
val rampDF = spark.read.table("ramp_csv")
3. Read specific row
I am using the following logic in Scala
val myRow1st = rampDF.rdd.take(10).last
display(myRow1st)
and it should display 10th row but I am getting the following error
command-2264596624884586:9: error: overloaded method value display with alternatives:
[A](data: Seq[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])Unit <and>
(dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
(model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
(model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
(model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
(model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
(documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
cannot be applied to (org.apache.spark.sql.Row)
display(myRow1st)
^
Command took 0.12 seconds --
Could you please share what I am missing here? I tried few other things but it didn't work. Thanks in advance for help!
Upvotes: 2
Views: 4833
Reputation: 770
I also go with João Guitana's answer. An alternative to get specifically the 10'th record:
val df = 1 to 1000 toDF
val tenth = df.limit(10).collect.toList.last
tenth: org.apache.spark.sql.Row = [10]
That will return the 10th Row
on that df
Upvotes: 0
Reputation: 700
I'd go with João's answer as well. But if you insist on getting the Nth row as a DataFrame
and avoid collecting to the driver node (say when N is very big) you can do:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = 1 to 100 toDF //sample data
val cols = df.columns
df
.limit(10)
.withColumn("id", monotonically_increasing_id())
.agg(max(struct(("id" +: cols).map(col(_)):_*)).alias("tenth"))
.select(cols.map(c => col("tenth."+c).alias(c)):_*)
This will return:
+-----+
|value|
+-----+
| 10|
+-----+
Upvotes: 0
Reputation: 241
Here is the breakdown of what is happening in your code:
rampDF.rdd.take(10)
returns Array[Row]
.last
returns Row
display()
takes a Dataset
and you are passing it a Row
. You can use .show(10)
to display the first 10 rows in tabular form.
Another option is to do display(rampDF.limit(10))
Upvotes: 3