What is the best practice for downloading large CSV files from S3 in Java?

Question

I'm trying to get a large CSV file from S3 but the download fails with “java.net.SocketException: Connection reset”, which is probably due to the InputStream simply being open for too long (the download often takes more than an hour since I am doing multiple time-consuming processes on the streamed content). This is how I currently parse the file:

InputStream inputStream = new GZIPInputStream(s3Client.getObject("bucket", "key").getObjectContent());
Reader decoder = new InputStreamReader(inputStream, Charset.defaultCharset());
BufferedReader isr = new BufferedReader(decoder);
CSVParser csvParser = new CSVParser(isr, CSVFormat.DEFAULT);
CSVRecord nextRecord = csvParser.iterator().next();
...

I know I have to split the download into multiple short getObject-calls with a defined offset for the GetObjectRequest, but I'm wondering how to define this offset in case of a CSV, since I need complete lines.

Do I have to ditch the parser library and parse each line into an Object myself so I can keep a count of the read bytes and use it as an offset for the next batch? That doesn't seem very robust to me. Is there any best practice way to achieve "batch downloading" of CSV records?

What is the best practice for downloading large CSV files from S3 in Java?

Answers (1)

Related Questions