Reputation: 1611

Remove duplicate rows from csv file without writing a new file

This is my code for now:

File file1 = new File("file1.csv");
File file2 = new File("file2.csv");
HashSet<String> f1 = new HashSet<>(FileUtils.readLines(file1));
HashSet<String> f2 = new HashSet<>(FileUtils.readLines(file2));
f2.removeAll(f1);

With removeAll() I remove all duplicates which are in file2 from file1, but now I want to avoid to create a new csv file to optimize the process. Just want to delete from file2 the duplicate rows.

Is this possible or do I have to create a new file?

Upvotes: 1

Answers (2)

fge

Reputation: 121820

now I want to avoid to create a new csv file to optimize the process.

Well, sure, you can do that... If you don't mind possibly losing the file!

DON'T DO THAT.

And since you use Java 7, well, use java.nio.file. Here's an example:

final Path file1 = Paths.get("file1.csv");
final Path file2 = Paths.get("file2.csv");
final Path tmpfile = file2.resolveSibling("file2.csv.new");

final Set<String> file1Lines 
    = new HashSet<>(Files.readAllLines(file1, StandardCharsets.UTF_8));

try (
    final BufferedReader reader = Files.newBufferedReader(file2,
        StandardCharsets.UTF_8);
    final BufferedWriter writer = Files.newBufferedWriter(tmpfile,
        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
) {
    String line;
    while ((line = reader.readLine()) != null)
        if (!file1Lines.contains(line)) {
            writer.write(line);
            writer.newLine();
        }
}

try {
    Files.move(tmpfile, file2, StandardCopyOption.REPLACE_EXISTING,
        StandardCopyOption.ATOMIC_MOVE);
} catch (AtomicMoveNotSupportedException ignored) {
    Files.move(tmpfile, file2, StandardCopyOption.REPLACE_EXISTING);
}

If you use Java 8, you can use this try-with-resources block instead:

try (
    final Stream<String> stream = Files.lines(file2, StandardCharsets.UTF_8);
    final BufferedWriter writer = Files.newBufferedWriter(tmpfile,
        StandardCharsets.UTF_8, StandardOpenOption.CREATE_NEW);
) {
    stream.filter(line -> !file1Lines.contains(line))
        .forEach(line -> { writer.write(line); writer.newLine(); });
}

Upvotes: 2

user840718

Reputation: 1611

I've solved with this line of code:

FileUtils.writeLines(file2, f2);

It is an overwrite and can be a good solution for small-medium file, but for very large dataset I sincerly don't know.

Upvotes: 0

Remove duplicate rows from csv file without writing a new file

Answers (2)

Related Questions