saurav
saurav

Reputation: 3462

Efficient Data Structure To Store Millions of Records

I have an input file that contains millions of records and each record contains again thousands of columns in which each of the column is separated by a delimiter.

Number of records and columns can vary from file to file.

I have a requirement that I have to parse these records and store them in the java object so that it can be further passed to the Drools Framework for the column level validation.

This is how my input data and schema file looks like.

Input File :

John|Doe|35|10 Floyd St|132|Los Angeles|CA|USA ... and so on 
...
...
Millions records like this

Schema File :

firstName|String|false|20|NA
lastName|String|false|20|NA
age|Integer|false|3|NA
addressLine1|String|false|20|NA
addressLine2|String|false|20|NA
city|String|false|5|NA
state|String|false|10|NA
country|String|false|10|NA

I tried to implement this solution with the help of a map and created a Java class containing this map.

class GenericRecord {
   Map<String,FieldSpecification> properties; //used HashMap as an implementation
}

class FieldSpecification {
    public String fieldName;
    public String dataType;
    public int length;
    public String value;
    public String format;
}

For reach row in the input file I am creating a Record Object and using map to store the values of its column. In addition to this I am also storing the metadata about the column in FieldSpecification Object like dataType , length ,format etc.

For few thousands of rows in my input file it worked fine but as soon as number of rows starts increasing , it started blowing off because of the memory issue (as expected). As it is creating millions of map of objects having thousands of keys in it.

I am aware this is not efficient solution is to solve this type of problem.

So my concern is will the in memory based solution work in my scenario or I have to prefer the disk based solution like embedded DB or disk based Maps .

Please advise if is there any other open source Map implementation that I can use.

Note : For File Parsing and Data Validation I am using hadoop and it is running on a 40 nodes cluster.

Here is the flow and implementation of my mapper:

Receives the value as complete row ,later this row is passed to a Java framework which converts it into a corresponding GenericObject (as mentioned above) and then this object is passed to the drools framework for further validation.

Mapper Implementation :

public void map(LongWritable key , Text value , Context context) throws IOException, InterruptedException {

        //Convert the text value to string i.e line by line comes here
        String record = value.toString();





        // Develop a drools service that will take record as an input 
        // and will validate it on the basis of XL sheet provided
        workingMemory = knowledgeBase.newStatefulKnowledgeSession();
        DroolsObject recordObject = DroolsServiceImpl.validateByRecord(record, fileMetaData, workingMemory);



        //Check to validate if the processed record
        if(recordObject.isValid) {
            context.getCounter(AppCounter.VALID_RECORD).increment(1);
            mapperOutputKey.set("A");
            mapperOutputValue.set(recordObject.toString());
            context.write(mapperOutputKey,mapperOutputValue);
        }

        else {
            context.getCounter(AppCounter.INVALID_RECORD).increment(1);
            mapperOutputKey.set("R");
            mapperOutputValue.set(recordObject.toStringWithErrors());
            context.write(mapperOutputKey,mapperOutputValue);
        }
}

Upvotes: 0

Views: 6097

Answers (2)

Jared
Jared

Reputation: 1897

Since you have to save every byte of data that is in the file in memory (except possibly the delimiters), start with looking at the size of the file and comparing it to the size of memory. If your file is bigger than memory, scratch the whole idea of keeping it in memory.

If memory is larger than the file, you have a chance, though you need to carefully examine how this file might grow in the future, what platforms will the program run on, etc.

So ASSUMING THAT IT FITS, you can be more efficient with your data structure. One simple way to save memory would be to scrap the maps and just save each record as a string (as encoded in the file). An array of strings should have minimal overhead, though you'll want to make sure you're not constantly resizing the original array as you populate it.

Keeping your data structures simple when they get large can save you a lot of memory on overhead.

Also, if the data will easily fit into memory, you may need to do some adjustments to the JVM to allocate enough memory to it (change the heap size using -Xmx) to get the JVM large enough. I hope you're using a 64-bit JVM on a 64-bit platform.

Upvotes: 1

choeger
choeger

Reputation: 3567

I would propose to keep the data in one (byte[][]) table and refer to rows via their number. Then you can use a cursor that reads the corresponding fields on demand:

class FieldSpecification {
    private final int row;
    private final byte[][] mem;

    public String fieldName();
    public String dataType();
    public int length();
    public String value();
    public String format();
}

The garbage collector should dispose these objects quite easily. You only need to care about their lifecycle.

When the byte-array does not fit into your memory, well, then you are screwed anyway.

You can then implement your map by mapping names to row numbers.

Upvotes: 0

Related Questions