Reputation: 31
I'm getting single record having multiple sub records in it separated by ,
Size of any single sub record is 200
characters and count of sub records can go upto 5 million records.
Is it a good practice to store all records in string array ? Will it casue any issue ? If yes how can I perform in efficient way ? Disk memory is sufficiently available.
inpString.split(,);
Source is giving me single record having all users info in Active directory.
Update
Here is input sample string with 2 subrecords (characters are less in every subrecord, its just as an example). It can be upto 5M
CN=100,OU=Employee,OU=groups,DC=AD,DC=myhost;CN=200,OU=Employee,OU=groups,DC=AD,DC=myhost;
Output in file
batchID,groupName,ou=groupapplicationname,CN=100,uid=100,DC=AD,DC=myhost,moreinfo
batchID,groupName,ou=groupapplicationname,CN=200,uid=100,DC=AD,DC=myhost,moreinfo
Upvotes: 1
Views: 877
Reputation: 1197
Wrote a program that creates an array of 5 million Strings and initializes them each with an array of 200 characters. (The Scanner
is to pause the
program while I go and take a look at the memory).
import java.util.Scanner;
public class ArrMem
{
public static void main(String args[])
{
String[] s = new String[5000000];
for(int i=0;i<5000000;i++)
{
s[i] = new String(new char[200]);
}
Scanner sc = new Scanner(System.in);
sc.nextLine();
}
}
And executed it. The RAM utilized is shown below.
Considering you won't be working with all the Strings at once, you should extract them from your file in batches (To reduce interactions with filesystem) and process them. This is when you want to stick to your method.
Batch Size | Execution Time | Memory Used |
---|---|---|
Larger | Lower | Higher |
Smaller | Higher | Lower |
Or
Use a BufferedReader()
to read the subrecords from file.
Upvotes: 0
Reputation: 718708
In theory, a Java string can contain close to 2^31 characters and a Java array can contain close to 2^31 strings.
In practice (assuming Java 8, 64 bit, no oops1) the space utilization of String[]
and String
are as follows:
String[]
array needs 8 bytes per entry,String
needs 2 bytes per character ... plus overheads of about 40 bytes per String
.It is easy to see that a maximal array of maximally sized strings would take more memory that than is addressible with 64 bit address, even assuming you could build a machine capable of holding that much memory. However that is just a theoretical concern ...
In your example:
My guess is that the space needed amounts to roughly 500 x 5,000,000 = 2.5GB heap space to represent the array and the strings. If you started by reading the entire record into memory as a String
before splitting it, that could be as much as 7.5GB depending on how you read it. (But you can be smarter than that ...)
Is it a good practice to store all records in string array ?
It depends on what you intend to do with the records. Without more information we can't say whether it is a good idea.
Note that there is no such thing as "good practice" or "best practice" in the general sense. Solutions need to be designed for purpose, and judgement about them can only be made in context.
Will it casue any issue ?
As per the above, it could use a lot of heap space.
If yes how can I perform in efficient way?
We can't tell you that unless you explain clearly what you are actually going to do with the records in memory.
It also depends on what kind of efficiency you are concerned about. CPU utilization? Memory utilization? Software developer time?
Disk memory is sufficiently available.
That may or may not be relevant. It depends on what you are going to do with the records in memory.
1 - The amount of space used to represent strings is JVM dependent in a number of respects. For example, for Java 9 onwards, strings that consist of ASCII characters only need 1 byte per character.
So looking at your updated question, it is clear that reading the entire file into memory and splitting it is the wrong approach.
What you need to do is to read characters until you get a record; i.e until you get a ;
. Then you split the record into fields based on ,
. Then you process the fields and output them. Finally you discard that record and start reading the next one.
In other words you avoid creating a huge array of 5,000,000 String in memory.
Upvotes: 1