Rony
Rony

Reputation: 21

Reading Big File in Java

I have a swing application which works on CSV file. It reads full file line by line, computes some required statistics and shows output. The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data. The problem is that JVM take 4 times the memory than that of file size. (while processing 86MB of file Heap area uses 377 MB of space - memory utilization checked using jVisualVM).

Note:

  1. I have used LineNumberReader for reading file (beacause of specific requirement, I can change it if that helps in memory usage)

  2. For reading every line readLine() is used and then .split(',') of that line which is String is called for individual fields of that record.

  3. Each record in stored in Vector for displaying in JTable, whereas other statisics are stored in HashMap, TreeMap and summary data in JavaBean class. Also one graph is plotted using JFreeChart.

Please suggest to reduce Memory utilization as I need to process 2GB file.

Upvotes: 1

Views: 739

Answers (4)

Nate
Nate

Reputation: 557

Increase the JVM heap size (-Xms and -Xmx). If you have the memory, this is the best solution. If you cannot do that, you will need to find a compromise that will be a combination of data model and presentation (GUI) changes, usually resulting in increased code complexity and potential for bugs.

  1. Try modifying your statistics algorithms to do their work as the data is being read, and not require it all exist in memory. You may find algorithms that approximate the statistics to be sufficient.
  2. If your data contains many duplicate String literals, using a HashSet to create a cache. Beware, caches are notorious for being memory leaks (e.g. not clearing them before loading different files).
  3. Reduce the amount of data being displayed on the graph. It is common for a graph with lot of data to have many points being displayed at or near the same pixel. Consider truncating the data by merging multiple values at or near the same position on the x-axis. If your data set contains 2,000,000 points, for example, most of them will coincide with other nearby points, so your underlying data model does not need to store everything.
  4. Beware of information overload. Will your JTable be meaningful to the user if it contains 2GB worth of data? Perhaps you should paginate the table, and read only 1000 entries from file at a time for display.
  5. I'm hesitant to suggest this, but during the loading process, you could convert the CSV data into a file database (such as cdb). You could accumulate statistics and store some data for the graph during the conversion, and use the database to quickly read a page of data at a time for the JTable as suggested above.

Upvotes: 0

Raffaele
Raffaele

Reputation: 20885

Every Java object has a memory overhead, so if your Strings are really short, that could explain why you get 4 times the size of your file. You also have to compute the size of the Vector and it's internals. I don't think that a Map would improve memory usage, since Java Strings already try to point to the same address in memory whenever possible.

I think you should revise your design. Given your requirements

The Upper part of output screen shows each record from file in that order in JTable, whereas lower part shows statistics computed based on that data

you don't need to store the whole file in memory. You need to read it entirely to compute your statistics, and this can certainly be done using a very small amount of memory. Regarding the JTable part, this can be accomplished in a number of ways without requiring 2GB of heap space for your program! I think there must be something wrong when someone wants to keep a CSV in memory! Apache IO LineIterator

Upvotes: 0

Chander Shivdasani
Chander Shivdasani

Reputation: 10131

Try giving OpenCSV a shot. It only stores the last read line when you use readNext() method. For large files this is perfect.

From their website, the following are the features they support:

  • Arbitrary numbers of values per line

  • Ignoring commas in quoted elements

  • Handling quoted entries with embedded carriage returns (ie entries that span multiple lines)

  • Configurable separator and quote characters (or use sensible defaults)

  • Read all the entries at once, or use an Iterator style model

  • Creating csv files from String[] (ie. automatic escaping of embedded quote chars)

Upvotes: 1

user1354006
user1354006

Reputation:

Use best practices to upgrade your program

  1. Write Multithread in program to get better cpu utilization.
  2. Set heap minimum and maximum heap size to get better use of ram.
  3. Use proper data structure and design.

Upvotes: 0

Related Questions