Sree Eedupuganti
Sree Eedupuganti

Reputation: 440

SparkRDD Operations

Let's assume i had a table of two columns A and B in a CSV File. I pick maximum value from column A [Max value = 100] and i need to return the respective value of column B [Return Value = AliExpress] using JavaRDD Operations without using DataFrames.

Input Table :

COLUMN A     Column B   
   56        Walmart
   72        Flipkart
   96        Amazon
   100       AliExpress

Output Table:

COLUMN A     Column B   
  100        AliExpress

This is what i tried till now

SourceCode:

SparkConf conf = new SparkConf().setAppName("SparkCSVReader").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf); 
JavaRDD<String> diskfile = sc.textFile("/Users/apple/Downloads/Crash_Data_1.csv");
JavaRDD<String> date = diskfile.flatMap(f -> Arrays.asList(f.split(",")[1]));

From the above code i can fetch only one column data. Is there anyway to get two columns. Any suggestions. Thanks in advance.

Upvotes: 1

Views: 71

Answers (2)

Pawan B
Pawan B

Reputation: 4623

Data:

COLUMN_A,Column_B
56,Walmart
72,Flipkart
96,Amazon
100,AliExpress

Creating df using Spark 2

val df = sqlContext.read.option("header", "true")
                   .option("inferSchema", "true")
                   .csv("filelocation")

df.show
import sqlContext.implicits._

import org.apache.spark.sql.functions._

Using Dataframe functions

df.orderBy(desc("COLUMN_A")).take(1).foreach(println) 

OUTPUT:

[100,AliExpress]

Using RDD functions

df.rdd
  .map(row => (row(0).toString.toInt, row(1)))
  .sortByKey(false)
  .take(1).foreach(println)

OUTPUT:

   (100,AliExpress)

Upvotes: 0

avr
avr

Reputation: 4893

You can use either top or takeOrdered functions to achieve it.

rdd.top(1)  //gives you top element in your RDD

Upvotes: 1

Related Questions