Mengyang
Mengyang

Reputation: 51

Out of Memory issue (Heap) from generating large csv file

I have an application for users to get data from database and download as csv file.

The general workflow follows:

  1. User click download button at frontend.
  2. Backend (SpringBoot in this case) will start an async thread to get data from database.
  3. Generate csv files with data from step (2) and upload to google cloud storage.
  4. Send user an email with signed url to download the data.

My problem is backend keep throwing "OOM Java heap space" error under some extreme cases. For extreme case, all my memory was filled (4GB). My initial plan was to load data via pagination from database (not all at once to save memory), and generate a csv for each page data. In this case, GC will clear the memory once a csv was generated to keep whole memory usage is not that high. However, the actual case is memory is increasing all the time until all are used up. The GC does not work as expected. I got total 18 pages and around 200000 record (from db) per page at extreme case.

I used JProfiler to monitor heap usage and found that the retained size of those large byte[] objects are not 0 which might represent there exist some references link to them (I guess that's why GC does not clear them from memory as expected).

How should I optimize my code and VM environment to make sure the memory usage can be lower than 1GB for extreme case? What makes those large byte[] objects not cleared by GC as expected?

The code to get data from database and generate csv file

@Override
    @Async
    @Transactional(timeout = DOWNLOAD_DATA_TRANSACTION_TIME_LIMIT)
    public void startDownloadDataInCSVBySearchQuery(SearchQuery query, DownloadRequestRecord downloadRecord) throws IOException {
        logger.debug(Thread.currentThread().getName() + ": starts to process download data");
        String username = downloadRecord.getUsername();
        // get posts from database first
        List<? extends SocialPost> posts = this.postsService.getPosts(query);
        try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
            // get ids of posts
            List<String> postsIDs = this.getPostsIDsFromPosts(posts);
            int postsSize = postsIDs.size();
            // do pagination db search. For each page, there are 1500 posts
            int numPages = postsSize / POSTS_COUNT_PER_PAGE + 1;
            for (int i = 0; i < numPages; i++) {
                logger.debug("Download comments: start at page {}, out of total page {}", i + 1, numPages);
                int pageStartPos = i * POSTS_COUNT_PER_PAGE; // this is set to 1500
                int pageEndPos = Math.min((i + 1) * POSTS_COUNT_PER_PAGE, postsSize);
                // get post ids per page
                List<String> postsIDsPerPage = postsIDs.subList(pageStartPos, pageEndPos);
                // use posts ids to get corresponding comments from db, via sql "IN"
                List<Comment> commentsPerPage = this.commentsService.getCommentsByPostsIDs(postsIDsPerPage);
                // generate csv file for page data and upload to google cloud
                String commentsFileName = "comments-" + downloadRecord.getDownloadTime() + "-" + (i + 1) + ".csv";
                this.csvUtil.generateCommentsCsvFileStream(commentsPerPage, commentsFileName, out);
                this.googleCloudStorageInstance.uploadDownloadOutputStreamData(out.toByteArray(), commentsFileName);
            }
        } catch (Exception ex) {
            logger.error("Exception from downloading data: ", ex);
        }

Code to generate csv file

// use Apache csv 
public void generateCommentsCsvFileStream(List<Comment> comments, String filename, ByteArrayOutputStream out) throws IOException {
        CSVPrinter csvPrinter = new CSVPrinter(new OutputStreamWriter(out), CSVFormat.DEFAULT.withHeader(PostHeaders.class).withQuoteMode(QuoteMode.MINIMAL));
        for (Comment comment: comments) {
            List<Object> record = Arrays.asList(
// write csv content
                    comment.getPageId(),
                    ...
            );
            csvPrinter.printRecord(record);
        }
        // close printer to release memory
        csvPrinter.flush();
        csvPrinter.close();
    }

Code to upload file to goole cloud storage

    public Blob uploadDownloadOutputStreamData(byte[] fileStream, String filename) {
        logger.debug("Upload file: '{}' to google cloud storage", filename);
        BlobId blobId = BlobId.of(this.DownloadDataBucketName, filename);
        BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
        return this.cloudStorage.create(blobInfo, fileStream);
    }

The heap usage is increasing all the time as page increases.The G1 old gen heap usage is still very high after system crush.

enter image description here

The G1 Eden space is almost empty, big files are saved into Old gen directly. enter image description here

Old gen GC activity is low, most of GC activities come from Eden space:

enter image description here

enter image description here

Heap walker shows the retained size of those big byte[] is not 0.

enter image description here

enter image description here

enter image description here

Upvotes: 1

Views: 2366

Answers (1)

Renato
Renato

Reputation: 13690

You're using a single instance of ByteArrayOutputStream which just writes to a in-memory byte array.

That looks like a mistake because you seem to only want to upload each page at a time, not the accumulated result so far (which includes ALL pages).

By the way, doing this is useless:

try (ByteArrayOutputStream out = new ByteArrayOutputStream())

ByteArrayOutputStream does not need to be closed as it lives in memory. Just remove that. And create a new instance for each page (inside the pages for loop) instead of re-using the same instance for all pages and it might just work fine.

EDIT

Another advice would be to break this code up into more methods... not just because it's more readable with smaller methods, but because you're keeping temporary variables in scope for too long (causing unnecessary memory to stick around longer than needed).

For example:

List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
   // get ids of posts
   List<String> postsIDs = this.getPostsIDsFromPosts(posts);
   ....

From this point on, posts is not used anymore, and I assume that it contains a lot of stuff... so you should "drop" that variable once you got the IDs.

Do something like this instead:

List<String> postsIDs = getAllPostIds(query);

....
List<String> getAllPostIds(SearchQuery query) {
    // this variable will be GC'd after this method returns as it's no longer referenced (assuming getPostIDsFromPosts() doesn't store it in a field)
    List<? extends SocialPost> posts = this.postsService.getPosts(query);
    return this.getPostsIDsFromPosts(posts);    
}

Upvotes: 1

Related Questions