Priyanka
Priyanka

Reputation: 51

Iterating in Scala DataFrame

I have a DataFrame in spark with Sample accounts which has 5 different columns.

val sampledf= sqlContext.sql(select * from Sampledf)

I have other table in oracle db which has millions of records. OracleTable

I want to filter Accounts present in OracleTable with respect to SampleDF

Select * from OracleTable where  column in (select column from SamplesDf)

I realized that in oracle we can not provide more than 1000 values in IN condition.

And below subquery query is not working. Due to huge data in OracleTable

I want to achieve below query

select column from OracleTable where (acctnum in (1,2,3,...1000) or acctnum in (1001,....2000) .... Basically all the accounts from SampleDF (every 1000 accounts)

Since we cant give more than 1000 at once (that's the limitation in Oracle) we can give 1000 every time.

How can I generate this kind of dynamic query. DO I need to create Array from Dataframe?

I just need a work around, how can I proceed. Any suggestions will be helpful.

Upvotes: 0

Views: 97

Answers (2)

Chandan Ray
Chandan Ray

Reputation: 2091

broadcast join is the best option which will broadcast the smaller dataframe across the cluster. As it’s mentioned the reading oracle data it’s taking time, it might be due to the profile restrictions of number of parallel sessions.

See below work around to build a dynamic in condition.

Val newsampledf = sampledf.withColumn(“seq”,row_number().over(Window.orderBy(“yourcolumn”)).select(“yourcolumn”, “seq”)

var i = 1L
var j = 0L
while(i <= (cnt/999))
{ var sql = newsampledf.select(“yourcolumn”).where(col(“seq” >= j).where(col(“seq”) <j + 999) j=j+999 i=i+1}

Upvotes: 1

Amar
Amar

Reputation: 278

You can try to join the both tables based on the column.

  1. Load the Oracle table as dataframe
  2. Join the oracleDF with sampleDF

    val resultDF=oracleDF.join(sampleDF,seq("column"))
    
  3. Use broadcast if sampleDF is small for better performance

    val resultDF=oracleDF.join(broadcast(sampleDF),seq("column"))
    

Hope it helps you.

Upvotes: 0

Related Questions