Reputation: 6111
I have one File
base.txt
5071111111
5071111112
5071111113
5071111114
..... around 15 lakh numbers
and another file
status.txt
5071111112,sended
5071111113,failed
.....
Actual Scenario is, I have base file containing mobile number for sending message and other file containing the status of the message for each number which is stored in status.txt.
Now my task is to merge two file and keep common file like
merged.txt
5071111111
5071111112,sended
5071111113,failed
5071111114
....... and so on
I tried usual solution to take one number from status.txt ie "5071111112,sended" and compare to base.txt and if number is not found then copy the number in merged.txt and if number is found then copy the updated content of that number in merged.txt.
Now merged.txt will act as base file for me.
Also, status.txt file comes in at regular intervals, so process of comparing and creating new merging.txt file and deleting previous one and rename new one goes on and on.
I also tried RandomAccessFile class but I am facing data truncation issue in that similar to problem described here, link
I read few answers posted on Stackoverflow but many are suggesting the way I mentioned above. do we have any other solution to it.
Upvotes: 0
Views: 186
Reputation: 6111
I just thought and implemented one solution with help of your posts and getting the desired result. just want to confirm whether it is good solution or not.
Now in first step, I am sorting my base.txt file
In second step, I am splitting my base.txt file, which contain around 10,00,000 numbers into many files, containing 1,00,000 number in each file.(I am splitting file considering in mind that instead of taking complete 10,00,000 numbers in memory by using HashMap or something, I may go into out of memory error).
Now after base file is split-ted into chunks. I am maintaining 1 index file which keep track of numbers present inside split-ted files.
limit file-name
1-1,00,000 split0.txt
1,00,001-2,00,000 split1.txt
Now, I start reading status.txt file, and pick one number from it which I have to merge, with the help of index file I will come to know which file exactly I have to pick for update.
Now, since file with chunks contain around 1,00,000 number (say split4.txt ), I am taking it in hashMap and updating the correct record and write hashMap again into that file.
I am getting the desired result by using this solution, Just want to confirm, is it the right approach or I am missing anything.
Thanks
Upvotes: 0
Reputation: 5648
There are a couple of ways to approach this problem and they aren't Java specific (which is what people are eluding to). These are CS questions.
What you need to do is find the intersection of set 'A' with set 'B' -- in Java 2 ready classes can do this (HashSet and TreeSet). These are both backed by their equivalent Map types.
There are 2 ways you can approach this problem:
1) Sort the files in chunks ala Binary Search Tree (This implies that for any sorted tree the sub trees are also sorted). In this case you would be creating sorted subtrees using whatever memory space you think you can handle for the smaller sorts (generally, memory space will be some modulus of the number of entries in the file). You can write intermediate sort results to a temporary file.
2) Use a bloom filter to dramatically reduce the number of considered elements. Create a bloom filter of the super set (which for your case would be the file without the status codes). Then use the filter to DEFINITELY remove elements that will never be in the other set.
If you don't have a clear super set you can apply cross filtering where you create a set of bloom bits for set 'A' and remove any from 'B' that for sure are not contained in 'A' and then reverse this process.
What you end up with are 2 dramatically smaller sets that are 'probably' intersected. At this points you can probably just use setA.retainAll(setB) to produce the common elements.
If your sets are just unwieldy you can use #2 before applying #1 or #3 below
3) Setup a map-reduce job with cassandra and some virts. You could setup some EC2 instances or use inhouse virts. Your job would get done a lot faster.
Upvotes: 1
Reputation: 14964
Assuming your files are, or can be, sorted, merging them using two cursors as you described is the best solution.
You may also consider using a database.
Upvotes: 0
Reputation: 22084
If the files are not extremly big you can read on file and put the number in a map.
Map<String(Phonenumber), String(Status)>
Then you read through the second file, line by line and put the status in the map.
When done, you iterate through the Map and write it to the the merged file.
for(Entry<String, String>e : map.entrySet())
write(e.getvalue());
However, this is easy to do if you can load everything in memory, so it depends on how big these files really are. If we are talking about gigabytes then it might not be working.
If it is an option to install i.e. cygwin so you can make use of unix shell commands I would do it like this (or if you otherwise can sort them together in a single file):
sort -u base status > temporary
this way you are guaranteed to have every number right after each other. Then write a small java script reading each line. Kepp the number in memory and when more status messages are coming, add them. When the next number is not the same as before you write it to the merged file which will be your final result.
Upvotes: 0
Reputation: 4674
I wold build two input streams and read the first line of the base.txt and the status.txt and compare them
Loop:
-If the numbers are equal (make a substring in the status.txt and compare it with the base.txt) write the line from the base.txt and realease both lines
-if they are not equal write the one with the lower numhber and realese it
read the next line(s)
This will only work if they are ordered by the numbers (Otherwise you should sort them first).
If the run time is no problem you could easily implement bubble sort and do it line by line ;)
Upvotes: 0