Vikas Saxena
Vikas Saxena

Reputation: 1183

Storing multiple Strings in the Value field of a Map

In one of my Banking project I have a RecordFile file which contains some records in the format of:

CustomerNumber,AccountNumber,FirstName,LastName, some other fields...

In some transactional records which are present in a different file altogether, either of CustomerNumber or AccountNumber or (rarely) both gets populated.

The purpose of the mapreduce job is to enrich the transactional data with RecordFile

There are two inputs to the job 1) directory with file contaning transactional records Records are of Format SourceAccountNumber, SrcCustomerNumber, DestinationAccountNumber, DestinationCustomerNbr, AmountTransferred (some other fields)

The issue is that in somecases all the fields may not be populated and this has to be enriched using the RecordFile A sample record is:

1001,,1005,5005,75,...

In this record if you see the sourceCustomerNbr i.e. customer initiating transcation is not populated

,5003,1002,,49,.....

In this record, the srcAccountNumber and DestinationCustomerNbr is missing

2) RecordFile This gil contains Customer details such as customernumber, account number, firstname, lastname, SSN etc etc

Format is

CustomerNumber,AccountNumber,FirstName,LastName, some other fields... Eg

1001,5001,John,Nash,.... 1002,5002,Kevin,Petersom,.. 1003,5003,Sue-Ann,Lim,.... 1004,5004,Michael,Chong,... 1005,5005,Phillip,Anderson,....

The final output should have the format

SourceAccountNumber, SrcCustomerNumber, SourceCustomerFirstNmae,SourceCustomerLastName, DestinationAccountNumber, DestinationCustomerNbr, DestCustomerFirstNmae,DessCustomerLastName, AmountTransferred

Eg:

1001,5001,John,Nash,1005,5005,Phillip,Anderson,.....

1003,5003,Sue-Ann,Lim,1002,5002,Kevin,Peterson,....

My question is if I have to add the fields for FirstName and LastName in the enrichment using the recordfile How should I be breaking the record file in terms of Maps

1) two different maps Map1 (has CustomeNbr as key and first name as value) and Map2(has customerNbr as key and lastName as value) 2) One single Map mapSingle(has CustomerNbr as key but an object of a userdefined class as Value which has both firstname and lastname as fields)

Which of them will be more faster in terms of performance, consider the fact that the RecordFile has 10million+ records and The transactionData is almost 10 gb in volume for every 15 mins window and this job runs every 15 min to enrich the data.

Upvotes: 0

Views: 56

Answers (1)

techPackets
techPackets

Reputation: 4516

The 2nd version is more efficient, you only lookup the key in the map once while in the first version you look it up twice hence calculating twice the hashcode of the key and looking in the hashbuckets.

It's also a more flexible approach, in future if you want to add any more fields wrt customer you can do that. Otherwise you have to create a new map for that field too.

You can also check the performance of your code snippet by using JMH. JMH is a Java harness for building, running, and analysing nano/micro/milli/macro benchmarks written in Java and other languages targetting the JVM.

Upvotes: 1

Related Questions