Reputation: 53
I have some csv files in my data lake which are being quite frequently updated through another process. Ideally I would like to be able to query these files through spark-sql, without having to run an equally frequent batch process to load all the new files into a spark table.
Looking at the documentation, I'm unsure as all the examples show views that query existing tables or other views, rather than loose files stored in a data lake.
Upvotes: 0
Views: 688
Reputation: 380
You can do something like this if your csv is in S3 under the location s3://bucket/folder
:
spark.sql(
"""
CREATE TABLE test2
(a string, b string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION 's3://bucket/folder'
"""
)
You have to adapt the fields tho and the field separators. To test it, you can first run:
Seq(("1","a"), ("2","b"), ("3","a"), ("4","b")).toDF("num", "char").repartition(1).write.mode("overwrite").csv("s3://bucket/folder")
Upvotes: 1