Reputation: 77
I wrote a file dupelication processor which gets the MD5 hash of each file, adds it to a hashmap, than takes all of the files with the same hash and adds it to a hashmap called dupeList. But while running large directories to scan such as C:\Program Files\ it will throw the following error
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Unknown Source)
at java.nio.file.Files.readAllBytes(Unknown Source)
at com.embah.FileDupe.Utils.FileUtils.getMD5Hash(FileUtils.java:14)
at com.embah.FileDupe.FileDupe.getDuplicateFiles(FileDupe.java:43)
at com.embah.FileDupe.FileDupe.getDuplicateFiles(FileDupe.java:68)
at ImgHandler.main(ImgHandler.java:14)
Im sure its due to the fact it handles so many files, but im not sure of a better way to handle it. Im trying to get this working so I can sift thru all my kids baby pictures and remove dupelicates before I put them on my external harddrive for longterm storage. Thanks everyone for the help!
My code
public class FileUtils {
public static String getMD5Hash(String path){
try {
byte[] bytes = Files.readAllBytes(Paths.get(path)); //LINE STACK THROWS ERROR
byte[] hash = MessageDigest.getInstance("MD5").digest(bytes);
bytes = null;
String hexHash = DatatypeConverter.printHexBinary(hash);
hash = null;
return hexHash;
} catch(Exception e){
System.out.println("Having problem with file: " + path);
return null;
}
}
public class FileDupe {
public static Map<String, List<String>> getDuplicateFiles(String dirs){
Map<String, List<String>> allEntrys = new HashMap<>(); //<hash, file loc>
Map<String, List<String>> dupeEntrys = new HashMap<>();
File fileDir = new File(dirs);
if(fileDir.isDirectory()){
ArrayList<File> nestedFiles = getNestedFiles(fileDir.listFiles());
File[] fileList = new File[nestedFiles.size()];
fileList = nestedFiles.toArray(fileList);
for(File file:fileList){
String path = file.getAbsolutePath();
String hash = "";
if((hash = FileUtils.getMD5Hash(path)) == null)
continue;
if(!allEntrys.containsValue(path))
put(allEntrys, hash, path);
}
fileList = null;
}
allEntrys.forEach((hash, locs) -> {
if(locs.size() > 1){
dupeEntrys.put(hash, locs);
}
});
allEntrys = null;
return dupeEntrys;
}
public static Map<String, List<String>> getDuplicateFiles(String... dirs){
ArrayList<Map<String, List<String>>> maps = new ArrayList<Map<String, List<String>>>();
Map<String, List<String>> dupeMap = new HashMap<>();
for(String dir : dirs){ //Get all dupe files
maps.add(getDuplicateFiles(dir));
}
for(Map<String, List<String>> map : maps){ //iterate thru each map, and add all items not in the dupemap to it
dupeMap.putAll(map);
}
return dupeMap;
}
protected static ArrayList<File> getNestedFiles(File[] fileDir){
ArrayList<File> files = new ArrayList<File>();
return getNestedFiles(fileDir, files);
}
protected static ArrayList<File> getNestedFiles(File[] fileDir, ArrayList<File> allFiles){
for(File file:fileDir){
if(file.isDirectory()){
getNestedFiles(file.listFiles(), allFiles);
} else {
allFiles.add(file);
}
}
return allFiles;
}
protected static <KEY, VALUE> void put(Map<KEY, List<VALUE>> map, KEY key, VALUE value) {
map.compute(key, (s, strings) -> strings == null ? new ArrayList<>() : strings).add(value);
}
public class ImgHandler {
private static Scanner s = new Scanner(System.in);
public static void main(String[] args){
System.out.print("Please enter locations to scan for dupelicates\nSeperate Location via semi-colon(;)\nLocations: ");
String[] locList = s.nextLine().split(";");
Map<String, List<String>> dupes = FileDupe.getDuplicateFiles(locList);
System.out.println(dupes.size() + " dupes detected!");
dupes.forEach((hash, locs) -> {
System.out.println("Hash: " + hash);
locs.forEach((loc) -> System.out.println("\tLocation: " + loc));
});
}
Upvotes: 0
Views: 2099
Reputation: 135
i had this java heap space error on my windows machine and i spend weeks searching online for a solution, i tried increasing my -Xmx value higher but to no success. i even try running my spring boot app with a parameter to increase the heap size during run time with command like one below
mvn spring-boot:run -Dspring-boot.run.jvmArguments="-Xms2048m -Xmx4096m"
but still no success. until i figured out i was running jdk 32 bit which has limited memory size and i had to uninstall the 32 bit and install the 64 bit which solved my issue for me. i hope this help someone with issue similar to mine.
Upvotes: 0
Reputation: 298103
Reading the entire file into a byte array does not only require sufficient heap space, it’s also limited to file sizes up to Integer.MAX_VALUE
in principle (the practical limit for the HotSpot JVM is even a few bytes smaller).
The best solution is not to load the data into the heap memory at all:
public static String getMD5Hash(String path) {
MessageDigest md;
try { md = MessageDigest.getInstance("MD5"); }
catch(NoSuchAlgorithmException ex) {
System.out.println("FileUtils.getMD5Hash(): "+ex);
return null;// TODO better error handling
}
try(FileChannel fch = FileChannel.open(Paths.get(path), StandardOpenOption.READ)) {
for(long pos = 0, rem = fch.size(), chunk; rem>pos; pos+=chunk) {
chunk = Math.min(Integer.MAX_VALUE, rem-pos);
md.update(fch.map(FileChannel.MapMode.READ_ONLY, pos, chunk));
}
} catch(IOException e){
System.out.println("Having problem with file: " + path);
return null;// TODO better error handling
}
return String.format("%032X", new BigInteger(1, md.digest()));
}
If the underlying MessageDigest
implementation is a pure Java implementation, it will transfer data from the direct buffer to the heap, but that’s outside your responsibility then (and it will be a reasonable trade-off between consumed heap memory and performance).
The method above will handle files beyond the 2GiB size without problems.
Upvotes: 2
Reputation: 38511
Consider using Guava :
private final static HashFunction HASH_FUNCTION = Hashing.goodFastHash(32);
//somewhere later
final HashCode hash = Files.asByteSource(file).hash(HASH_FUNCTION);
Guava will buffer the reading of the file for you.
Upvotes: 0
Reputation: 1789
You have a lot of solutions:
Don't read all bytes at one time, try to use a BufferedInputStream
, and read a lot of bytes every time. But not all the file.
try (BufferedInputStream fileInputStream = new BufferedInputStream(
Files.newInputStream(Paths.get("your_file_here"), StandardOpenOption.READ))) {
byte[] buf = new byte[2048];
int len = 0;
while((len = fileInputStream.read(buf)) == 2048) {
// Add this to your calculation
doSomethingWithBytes(buf);
}
doSomethingWithBytes(buf, len); // Do only with the bytes
// read from the file
} catch(IOException ex) {
ex.printStackTrace();
}
Use C/C++ for such thing, (well, this is unsafe, because you will handle the memory yourself)
Upvotes: 0
Reputation: 116472
Whatever implementation FileUtils
has is trying to read in whole files for calculating hash. This is not necessary: calculation is possible by reading content in smaller chunks. In fact it is sort of bad design to require this, instead of simply reading in chunks that are needed (64 bytes?). So maybe you need to use a better library.
Upvotes: 1