richard a
richard a

Reputation: 101

Spark context wholeTextFiles ,and JavaStreamingContext textFileStream not working in Apache spark cluster

I am beginner of Apache spark and Apache spark streaming programming, i configured to the Azure data-lake to Apache spark using the Hadoop connector link its connected properly and i can access the Adl data through terminal itself using

hadoop fs -ls adl://xxxxx.azuredatalakestore.net

its showing list of files name contains the directory, but i need same list to get from program itself i tried like that

SparkConf conf = new SparkConf().setAppName("ADL Application");     
JavaSparkContext jsc = new JavaSparkContext(conf);
@SuppressWarnings("resource")
JavaStreamingContext jcntxt=new JavaStreamingContext(jsc,Durations.seconds(1));     
JavaDStream<String> javaDStream = jcntxt.textFileStream("adl://xxxxx.azuredatalakestore.net/directory");
JavaEsSparkStreaming.saveJsonToEs(javaDStream, "modwebservice/docs");
jcntxt.start();
jcntxt.awaitTermination();

but its not showing any list i tried in the "wholeTextFiles" instead of "textFileStream" also but its not working, and instead of directory if i use file name its working properly i can get the data also.

JavaSparkContext jsc = new JavaSparkContext(conf);      
@SuppressWarnings("resource")
JavaStreamingContext jcntxt=new JavaStreamingContext(jsc,Durations.seconds(1));
JavaRDD<String> javaRDD6 = jsc.textFile(args[0],1);
Queue<JavaRDD<String>> microbatches = new LinkedList<JavaRDD<String>>();
microbatches.add(javaRDD6);  
JavaDStream<String> javaDStream = jcntxt.queueStream(microbatches);     
JavaEsSparkStreaming.saveJsonToEs(javaDStream, args[1]);
jcntxt.start();
jcntxt.awaitTermination();

and as per my knowledge Apache spark only work on local storage and hdfs file storage location only might be, i don't know how to get azure data lake directory contains file list i tried 2 way but its not working if anyone knows please share the link ,Thank you..

Upvotes: 2

Views: 849

Answers (1)

Peter Pan
Peter Pan

Reputation: 24148

Try to use the code below in Java to list all files on HDFS with Data Lake Store via wholeTextFiles method of class JavaSparkContext (also use the same method of SparkContext).

JavaSparkContext jsc = new JavaSparkContext();
String path = "adl://xxxxx.azuredatalakestore.net";
JavaPairRDD<String, String> jprdd = jsc.wholeTextFiles(String path);
for(Tuple2<String, String> tuple: jprdd.collect()) {  // Tuple2: <FileName, Content>
    System.out.println(tuple._1());
}

Hope it helps.

Upvotes: 1

Related Questions