Reputation: 75
I am new to big data development. I have an use case to read data from hdfs , process through spark and save to MySQL db. Reason for saving to MySQL db is the reporting tool is pointing to MySQL. So I have figured out below flow to achieve it. Could anyone verify and suggest any optimization/changes are needed.
val df = spark.read.format("csv")
.option("header", "true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("mode","failfast")
.load("hdfs://localhost:9000/user/testuser/samples.csv")
val resultsdf = df.select("Sample","p16","Age","Race").filter($"Anatomy".like("BOT"))
val prop=new java.util.Properties
prop.setProperty("driver", "com.mysql.cj.jdbc.Driver")
prop.setProperty("user", "root")
prop.setProperty("password", "pw")
val url = "jdbc:mysql://localhost:3306/meta"
df.write.mode(SaveMode.Append).jdbc(url,"sample_metrics",prop)
Upvotes: 0
Views: 218
Reputation: 5487
Change is required in this line val resultdf= ...
, you are using column Anatomy for filtering but you didn't select that column is select clause. Add that column otherwise you will end up with error- Analysis Exception unable to resolve column Anatomy.
val resultsdf = df.select("Sample","p16","Age","Race", "Anatomy").filter($"Anatomy".like("BOT"))
Optimizations:
You can use addtional properties like numPartitions
and batchsize
.
You can read about these properties here.
Upvotes: 1