linus jame
linus jame

Reputation: 31

Max capacity of String array

I'm getting single record having multiple sub records in it separated by ,

Size of any single sub record is 200 characters and count of sub records can go upto 5 million records.

Is it a good practice to store all records in string array ? Will it casue any issue ? If yes how can I perform in efficient way ? Disk memory is sufficiently available.

inpString.split(,);

Source is giving me single record having all users info in Active directory.

Update

Here is input sample string with 2 subrecords (characters are less in every subrecord, its just as an example). It can be upto 5M

CN=100,OU=Employee,OU=groups,DC=AD,DC=myhost;CN=200,OU=Employee,OU=groups,DC=AD,DC=myhost;

Output in file

batchID,groupName,ou=groupapplicationname,CN=100,uid=100,DC=AD,DC=myhost,moreinfo
batchID,groupName,ou=groupapplicationname,CN=200,uid=100,DC=AD,DC=myhost,moreinfo

Upvotes: 1

Views: 877

Answers (2)

Kitswas
Kitswas

Reputation: 1197

Wrote a program that creates an array of 5 million Strings and initializes them each with an array of 200 characters. (The Scanner is to pause the program while I go and take a look at the memory).

import java.util.Scanner;
public class ArrMem
{
    public static void main(String args[])
    {
        String[] s = new String[5000000];
        for(int i=0;i<5000000;i++)
        {
            s[i] = new String(new char[200]);
        }
        Scanner sc = new Scanner(System.in);
        sc.nextLine();
    }
}

And executed it. The RAM utilized is shown below.

Memory details

Considering you won't be working with all the Strings at once, you should extract them from your file in batches (To reduce interactions with filesystem) and process them. This is when you want to stick to your method.

Batch Size Execution Time Memory Used
Larger Lower Higher
Smaller Higher Lower

Or

Use a BufferedReader() to read the subrecords from file.

Upvotes: 0

Stephen C
Stephen C

Reputation: 718708

In theory, a Java string can contain close to 2^31 characters and a Java array can contain close to 2^31 strings.

In practice (assuming Java 8, 64 bit, no oops1) the space utilization of String[] and String are as follows:

  • a String[] array needs 8 bytes per entry,
  • a String needs 2 bytes per character ... plus overheads of about 40 bytes per String.

It is easy to see that a maximal array of maximally sized strings would take more memory that than is addressible with 64 bit address, even assuming you could build a machine capable of holding that much memory. However that is just a theoretical concern ...

In your example:

My guess is that the space needed amounts to roughly 500 x 5,000,000 = 2.5GB heap space to represent the array and the strings. If you started by reading the entire record into memory as a String before splitting it, that could be as much as 7.5GB depending on how you read it. (But you can be smarter than that ...)


Is it a good practice to store all records in string array ?

It depends on what you intend to do with the records. Without more information we can't say whether it is a good idea.

Note that there is no such thing as "good practice" or "best practice" in the general sense. Solutions need to be designed for purpose, and judgement about them can only be made in context.

Will it casue any issue ?

As per the above, it could use a lot of heap space.

If yes how can I perform in efficient way?

We can't tell you that unless you explain clearly what you are actually going to do with the records in memory.

It also depends on what kind of efficiency you are concerned about. CPU utilization? Memory utilization? Software developer time?

Disk memory is sufficiently available.

That may or may not be relevant. It depends on what you are going to do with the records in memory.

1 - The amount of space used to represent strings is JVM dependent in a number of respects. For example, for Java 9 onwards, strings that consist of ASCII characters only need 1 byte per character.


So looking at your updated question, it is clear that reading the entire file into memory and splitting it is the wrong approach.

What you need to do is to read characters until you get a record; i.e until you get a ;. Then you split the record into fields based on ,. Then you process the fields and output them. Finally you discard that record and start reading the next one.

In other words you avoid creating a huge array of 5,000,000 String in memory.

Upvotes: 1

Related Questions