Read from hdfs and write to MySQL

Question

I am new to big data development. I have an use case to read data from hdfs , process through spark and save to MySQL db. Reason for saving to MySQL db is the reporting tool is pointing to MySQL. So I have figured out below flow to achieve it. Could anyone verify and suggest any optimization/changes are needed.

val df = spark.read.format("csv")
    .option("header", "true")
    .option("inferSchema","true")
    .option("nullValue","NA")
    .option("mode","failfast")
    .load("hdfs://localhost:9000/user/testuser/samples.csv")  

val resultsdf = df.select("Sample","p16","Age","Race").filter($"Anatomy".like("BOT"))  

val prop=new java.util.Properties
prop.setProperty("driver", "com.mysql.cj.jdbc.Driver")  
prop.setProperty("user", "root")  
prop.setProperty("password", "pw")  
val url = "jdbc:mysql://localhost:3306/meta" 
 df.write.mode(SaveMode.Append).jdbc(url,"sample_metrics",prop)

Mohana B C · Accepted Answer

Change is required in this line val resultdf= ..., you are using column Anatomy for filtering but you didn't select that column is select clause. Add that column otherwise you will end up with error- Analysis Exception unable to resolve column Anatomy.

val resultsdf = df.select("Sample","p16","Age","Race", "Anatomy").filter($"Anatomy".like("BOT"))

Optimizations: You can use addtional properties like numPartitions and batchsize. You can read about these properties here.

Read from hdfs and write to MySQL

Answers (1)

Related Questions