Spark SQL with different data sources

Is it possible to create data frames from 2 different sources and perform operations on those.

For example,

df1 = <create from a file or folder from S3>
df2 = <create from a hive table>

df1.join(df2).where("df1Key" === "df2Key")

If this is possible, what are the implications in doing so?

Upvotes: 1

Answers (2)

Avishek Bhattacharya

Reputation: 6964

Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki

The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.

You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe

One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.

Upvotes: 1

undefined_variable

Reputation: 6218

Yes.. It is possible to read from different datasource and perform operations on it. In fact many application will need those kind of requirements.

df1.join(df2).where("df1Key" === "df2Key")

This will do Cartesian join and then apply filter on it.

df1.join(df2,$"df1Key" === $"df2Key")

This should provide same output.

Upvotes: 1

Spark SQL with different data sources

Answers (2)

Related Questions