Reputation: 1508
I am trying to implement compression of files using Huffman encoding. Currently, I am writing the header as the first line of the compressed file and then writing the encoded binary strings (i.e. strings having the binary encoded value).
However, instead of reducing the file size, my file size is increasing as for every character like 'a', I am writing its corresponding binary, for example 01010001 which takes more space.
How can I write it into the file in a way that it reduces the space?
This is my code
public void write( String aWord ) {
counter++;
String content;
byte[] contentInBytes;
//Write header before writing file contents
if ( counter == 1 )
{
//content gets the header in String format from the tree
content = myTree.myHeader;
contentInBytes = content.getBytes();
try {
fileOutputStream.write(contentInBytes);
fileOutputStream.write(System.getProperty("line.separator").getBytes());
} catch (IOException e) {
System.err.println(e);
}
}
//content gets the encoded binary in String format from the tree
content = myTree.writeMe(aWord);
contentInBytes = content.getBytes();
try {
fileOutputStream.write(contentInBytes);
fileOutputStream.write(System.getProperty("line.separator").getBytes());
} catch (IOException e) {
System.err.println(e);
}
}
Sample input file:
abc
aef
aeg
Compressed file:
{'g':"010",'f':"011",'c':"000",'b':"001",'e':"10",'a':"11"}
11001000
1110011
1110010
Upvotes: 1
Views: 3744
Reputation: 20059
As I gathered from the comments, you are writing text, but what you really want to achive is writing binary data. What you currently have is a nice demo for huffman encoding, but impractical for actually compressing data.
To achieve compression, you will need to output the huffman symbols as binary data, where you currently output the string "11" for an 'a', you will need to just output two bits 11.
I presume this is currently coded in myTree.writeMe(), you need to modify the method to not return a String, but something more suited to binary output, e.g. byte[].
It depends a bit on the inner workings of you tree class how to do this. I presume you are using some StringBuilder internally and simply add the encoded symbol strings while looping over the input. Instead of a StringBuilder you will need a container capable of dealing with single bits. The only suitable class that comes to min immediately is java.util.BitSet (in practice one often will write a specialized class for this, with a specialized API to do this quickly). But for simplicity lets use BitSet for now.
In method writeMe, you will in principle do the following:
BitSet buffer = new BitSet();
int bitIndex = 0;
loop over input symbols {
huff_code = getCodeForSymbol(symbol)
foreach bit in huff_code {
buffer.put(bitIndex++, bit)
}
}
return buffer.toByteArray();
How to do this efficiently depends on how you internally defined the huffman code table. But prinicple is simple, loop over the code, determine if each place is a one or a zero and put them into the BitSet at consecutive indices.
if (digits == '1') {
buffer.set(bitIndex);
} else {
buffer.clear(bitIndex);
}
You now have your huffman encoded data. But the resulting data will be impossible to decompress properly, since you are currently processing words and you do not write any indication where the compressed data actually ends (you do this currently with a line feed). If you encoded for example 3 times an 'a', the BitSet would contain 11 11 11. Thats 6 bits, but when you convert to byte[] its gets padded to 8 bits: 0b11_11_11_00.
Those extra, unavoidable bits will confuse your decompression. You will need to handle this somehow, either by encoding first the number of symbols in the data, or by using an explicit symbol signaling end of data.
This should get you an idea how to continue. Many details depend on how you implement your tree class and the encoded symbols.
Upvotes: 6