KrunalParmar
KrunalParmar

Reputation: 1114

Create Spark RDD or dataframe from Strings coming from InputStream in java

I have a stream of Strings in java. That is coming from a csv file on some other machine. I am creating an InputStream and reading csv file line by line from BufferedReader in java as follows.

        //call a method that returns inputStream 


        InputStream stream = getInputStreamOfFile();

        BufferedReader lineStream = new BufferedReader(new InputStreamReader(stream));

        while ((inputLine = lineStream.readLine()) != null) {
            System.out.println("******************new Line***********");
            System.out.println(inputLine);
        }
        lineStream.close();
        stream.close();

Now, I want to create a spark RDD or DataFrame from this.

one solution is, I keep creating new RDD at each line and maintain globle RDD and continue doing union of RDDs. Is there any other solution ?

Note : this file is not on the same machine. It is coming from some remote storage. I do have the HTTP URL of the file.

Upvotes: 1

Views: 3177

Answers (1)

ForeverLearner
ForeverLearner

Reputation: 2103

If the contents of the inputStream fits in memory, we can use the following:

private static List<String> displayTextInputStream(InputStream input) throws IOException {
    // Read the text input stream one line at a time and display each line.
    BufferedReader reader = new BufferedReader(new InputStreamReader(input));
    String line = null;
    List<String> result = new ArrayList<String>();
    while ((line = reader.readLine()) != null) {
        result.add(line);
    }
    return result;
}

Now we can convert the List<String> to corresponding RDD.

S3Object fullObject = s3Client.getObject(new GetObjectRequest("bigdataanalytics", each.getKey()));
                            List<String> listVals = displayTextInputStream(fullObject.getObjectContent());
                            JavaRDD<String> s3Rdd = sc.parallelize(listVals);

Upvotes: 1

Related Questions