Hellen
Hellen

Reputation: 3532

How to use Hadoop InputFormats In Apache Spark?

I have a class ImageInputFormat in Hadoop which reads images from HDFS. How to use my InputFormat in Spark?

Here is my ImageInputFormat:

public class ImageInputFormat extends FileInputFormat<Text, ImageWritable> {

    @Override
    public ImageRecordReader createRecordReader(InputSplit split, 
                  TaskAttemptContext context) throws IOException, InterruptedException {
        return new ImageRecordReader();
    }

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }
}  

Upvotes: 11

Views: 11045

Answers (2)

vijay kumar
vijay kumar

Reputation: 2049

images all be stored in hadoopRDD ?

yes, everything that will be saved in spark is as rdds

can set the RDD capacity and when the RDD is full, the rest data will be stored in disk?

Default storage level in spark is (StorageLevel.MEMORY_ONLY) ,use MEMORY_ONLY_SER, which is more space efficient. Please refer spark documentation > scala programming > RDD persistance

Will the performance be influenced if the data is too big?

As data size increases , it will effect performance too.

Upvotes: 2

Robert Metzger
Robert Metzger

Reputation: 4542

The SparkContext has a method called hadoopFile. It accepts classes implementing the interface org.apache.hadoop.mapred.InputFormat

Its description says "Get an RDD for a Hadoop file with an arbitrary InputFormat".

Also have a look at the Spark Documentation.

Upvotes: 14

Related Questions