Reputation: 9004
Is it possible to create data frames from 2 different sources and perform operations on those.
For example,
df1 = <create from a file or folder from S3>
df2 = <create from a hive table>
df1.join(df2).where("df1Key" === "df2Key")
If this is possible, what are the implications in doing so?
Upvotes: 1
Views: 1438
Reputation: 6964
Dataframe is a source independent abstraction. I would encourage you to read the original paper on RDD and the wiki
The abstraction is source independent and keeps track of the location of the data and underlying DAG of operation. Dataframe APIs provides the schema of an RDD.
You can have dataframe from any source but they all homogenized to have same APIs. Dataframe APIs provides Dataframe reader interface which any underlying source can implement to create a dataframe on top of it. Here is another example of cassandra connector for dataframe
One caveat is the speed of data retrieval from the different sources might vary. For example if your data is in s3 vs data in HDFS then probably the operations on the dataframe created on top of HDFS might be faster. But nonetheless you will be able to perform any joins on the dataframes created from different sources.
Upvotes: 1
Reputation: 6218
Yes.. It is possible to read from different datasource and perform operations on it. In fact many application will need those kind of requirements.
df1.join(df2).where("df1Key" === "df2Key")
This will do Cartesian join and then apply filter on it.
df1.join(df2,$"df1Key" === $"df2Key")
This should provide same output.
Upvotes: 1