Efficient way of processing large CSV file using java

Let's consider a scenario

  1. Accounts.csv
  2. Transaction.csv

We have a mapping of each account number to transaction details, so 1 account number can have multiple transactions. Using these details we have to generate PDF for each account

If suppose, transaction CSV file is very large(>1 GB), then loading all the details and parsing could be the memory issue. So what could be the best approach to parse the transaction file ? Reading chunk by chunk also leading to memory consumption. Please advice

Upvotes: 1

Views: 1557

Answers (3)

Bruce Martin
Bruce Martin

Reputation: 10543

As others have said a Database would be a good solution.

Alternatively you could sort the 2 files on th account number. Most Operating systems provide efficient file sorting programs, e.g. for linux (sorting on 5th column)

LC_ALL=C sort -t, -k5 file.csv > sorted.csv

taken from Sorting csv file by 5th column using bash

You can then read the 2 files sequentially

Your Programming logic is:

if (Accounts.accountNumber < Transaction.accountNumber) {
    read Accounts file
} else if (Accounts.accountNumber = Transaction.accountNumber) {
    process transaction
    read Transaction file
} else {
    read Transaction file
}

The memory requirements will be tiny, you only need to hold one record from each file in memory.

Upvotes: 1

Thysce
Thysce

Reputation: 116

Of cause importing the data to a database first would be the most elegant way. Beside that your question leaves the impression that this isn‘t an option.

So I recommend you read the transactions.csv line-by-line (for instance by using a BufferedReader). Because in CSV Format each line is a record you can then (while reading) filter out and forget about each record that is not for your current account. After one file-traversal you have all transactions for one account and that should usually fit into memory. A downfall of this approach is that you end up reading the transactions multiple times, once for each accounts PDF generation. But if your application would need to be highly optimized, I suggest you would have already used a database.

Upvotes: 0

Linus
Linus

Reputation: 950

Let's say you are using Oracle as Database,. you could load the data into its corresponding tables using the Oracle SQL Loader tool.

Once the data is loaded you could use simple SQL Queries to Join and Query data from the loaded tables.

This will work in all types of Databases but you will need to find the appropriate tool for loading the data.

Upvotes: 0

Related Questions