Reputation: 101
I am beginner of Apache spark and Apache spark streaming programming, i configured to the Azure data-lake to Apache spark using the Hadoop connector link its connected properly and i can access the Adl data through terminal itself using
hadoop fs -ls adl://xxxxx.azuredatalakestore.net
its showing list of files name contains the directory, but i need same list to get from program itself i tried like that
SparkConf conf = new SparkConf().setAppName("ADL Application");
JavaSparkContext jsc = new JavaSparkContext(conf);
@SuppressWarnings("resource")
JavaStreamingContext jcntxt=new JavaStreamingContext(jsc,Durations.seconds(1));
JavaDStream<String> javaDStream = jcntxt.textFileStream("adl://xxxxx.azuredatalakestore.net/directory");
JavaEsSparkStreaming.saveJsonToEs(javaDStream, "modwebservice/docs");
jcntxt.start();
jcntxt.awaitTermination();
but its not showing any list i tried in the "wholeTextFiles" instead of "textFileStream" also but its not working, and instead of directory if i use file name its working properly i can get the data also.
JavaSparkContext jsc = new JavaSparkContext(conf);
@SuppressWarnings("resource")
JavaStreamingContext jcntxt=new JavaStreamingContext(jsc,Durations.seconds(1));
JavaRDD<String> javaRDD6 = jsc.textFile(args[0],1);
Queue<JavaRDD<String>> microbatches = new LinkedList<JavaRDD<String>>();
microbatches.add(javaRDD6);
JavaDStream<String> javaDStream = jcntxt.queueStream(microbatches);
JavaEsSparkStreaming.saveJsonToEs(javaDStream, args[1]);
jcntxt.start();
jcntxt.awaitTermination();
and as per my knowledge Apache spark only work on local storage and hdfs file storage location only might be, i don't know how to get azure data lake directory contains file list i tried 2 way but its not working if anyone knows please share the link ,Thank you..
Upvotes: 2
Views: 849
Reputation: 24148
Try to use the code below in Java to list all files on HDFS with Data Lake Store via wholeTextFiles
method of class JavaSparkContext
(also use the same method of SparkContext
).
JavaSparkContext jsc = new JavaSparkContext();
String path = "adl://xxxxx.azuredatalakestore.net";
JavaPairRDD<String, String> jprdd = jsc.wholeTextFiles(String path);
for(Tuple2<String, String> tuple: jprdd.collect()) { // Tuple2: <FileName, Content>
System.out.println(tuple._1());
}
Hope it helps.
Upvotes: 1