Reputation: 1796

Java: check CSV file on duplicate lines using ArrayList

I have a CSV file with this content:

2017-10-29 00:00:00.0,"1005",-10227,0,0,0,332894,0,0,222,332894,222,332894 2017-10-29 00:00:00.0,"1010",-125529,0,0,0,420743,0,0,256,420743,256,420743 2017-10-29 00:00:00.0,"1005",-10227,0,0,0,332894,0,0,222,332894,222,332894 2017-10-29 00:00:00.0,"1013",-10625,0,0,-687,599098,0,0,379,599098,379,599098 2017-10-29 00:00:00.0,"1604",-1794.9,0,0,-3.99,4081.07,0,0,361,4081.07,361,4081.07

So lines 1 and 3 are duplicates. Now I want to read the file in and print out duplicate lines in the console.

I set up this Java code reading the file in and throwing it line by line into an ArrayList. Then I create an immutable copy, loop through the ArrayList and in the binarySearch I use the immutable copy of the ArrayList:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class ReadValidationFile {

public static void main(String[] args) {

    List<String> validationFile = new ArrayList<>();

    try(BufferedReader br = new BufferedReader(new FileReader("validation_small.csv"));){

        String line;
        while((line = br.readLine())!= null){
            validationFile.add(line);
        }

    } catch (FileNotFoundException e) {
        //e.printStackTrace();
        System.out.println("file not found " + e.getMessage());
    } catch (IOException e) {
        e.printStackTrace();
    }

    List<String> validationFileCopy = Collections.unmodifiableList(validationFile);

    for(String line : validationFile){
        int comp = Collections.binarySearch(validationFileCopy,line,new ComparatorLine());
        if (comp <= 0){
            System.out.println(line);
        }

    }
}
}

Comparator Class:

import java.util.Comparator;

public class ComparatorLine implements Comparator<String> {
@Override
public int compare(String s1, String s2) {
    return s1.compareToIgnoreCase(s2);
}
}

I expect this line to be printed:

2017-10-29 00:00:00.0,"1005",-10227,0,0,0,332894,0,0,222,332894,222,332894

But the output I get is this:

2017-10-29 00:00:00.0,"1010",-125529,0,0,0,420743,0,0,256,420743,256,420743

Can you help me please to see what I am doing wrong? My comparator I think is okay. What is wrong with my ArrayLists?

Upvotes: 0

Answers (3)

DodgyCodeException

Reputation: 6123

The other answer(s) correctly state that you should be using Set instead of List. But for the sake of learning, let's have a look at your code and see where you went wrong.

public class ReadValidationFile {

public static void main(String[] args) {

    List<String> validationFile = new ArrayList<>();

    try(BufferedReader br = new BufferedReader(new FileReader("validation_small.csv"));){

Semicolon is unnecessary.

        String line;
        while((line = br.readLine())!= null){
            validationFile.add(line);
        }

This can all be achieved in just one line:
List<String> validationFile = Files.readAllLines(Paths.get("validation_small.csv"), "utf-8");

    } catch (FileNotFoundException e) {
        //e.printStackTrace();
        System.out.println("file not found " + e.getMessage());
    } catch (IOException e) {
        e.printStackTrace();
    }

    List<String> validationFileCopy = Collections.unmodifiableList(validationFile);

Actually, this is not a copy. It is just an unmodifiable view of the same list.

    for(String line : validationFile){
        int comp = Collections.binarySearch(validationFileCopy,line,new ComparatorLine());

You might as well just search validationFile itself. However, you are calling binarySearch which only works on sorted lists, but your list is not sorted. See documentation.

        if (comp <= 0){
            System.out.println(line);
        }

You are printing when it's not found (comp <= 0). If the search succeeds, it will return a non-negative number (comp >= 0). But another problem is that you are searching the whole list for each element, and the search will obviously always succeed (that is, if your list was sorted).

Save yourself all the trouble and use a Set instead. And, using Java 8 streams, the whole program can be reduced to the following:

public static void main(String[] args) throws Exception {
    Set<String> uniqueLines = new HashSet<>();
    Files.lines(Paths.get("", "utf-8"))
            .filter(line -> !uniqueLines.add(line))
            .forEach(System.out::println);
}

If you really need to ignore case when comparing strings (from your given data, it looks like it doesn't make any difference since it's just numbers), then store each unique line by first uppercasing and then lowercasing it. This apparently cumbersome technique is necessary because just lowercasing is not enough if dealing with non-English language text. The equalsIgnoreCase method also does this.

public static void main(String[] args) throws Exception {
    Set<String> uniqueLines = new HashSet<>();
    Files.lines(Paths.get("", "utf-8"))
            .filter(line -> !uniqueLines.add(line.toUpperCase().toLowerCase()))
            .forEach(System.out::println);
}

Upvotes: 3

JRG

Reputation: 4187

Create a Set while reading lines from the input csv file, anytime add() element to set returns false print the line as it is duplicate line.

If you want list of all duplicate lines then create a List which will have lines that returned false when tried add() to Set.

NOTE:

I have simulated your file reading by using a static data.
Small note, if your data only contains numbers and no alphabets then you do not need case-insensitive comparison.
If your data contains alphabets then also you do not need a special Comparator as you can insert data into Set using add(line.toLowerCase()) which will ensure that all lines are compared with lower case and then added to Set.

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;

public class ReadValidationFile {
    static List<String> validationFile = new ArrayList<>();
    static {
        validationFile.add("2017-10-29 00:00:00.0,\"1005\",-10227,0,0,0,332894,0,0,222,332894,222,332894");
        validationFile.add("2017-10-29 00:00:00.0,\"1010\",-125529,0,0,0,420743,0,0,256,420743,256,420743");
        validationFile.add("2017-10-29 00:00:00.0,\"1005\",-10227,0,0,0,332894,0,0,222,332894,222,332894");
        validationFile.add("2017-10-29 00:00:00.0,\"1013\",-10625,0,0,-687,599098,0,0,379,599098,379,599098");
        validationFile.add("2017-10-29 00:00:00.0,\"1604\",-1794.9,0,0,-3.99,4081.07,0,0,361,4081.07,361,4081.07");
    }

    public static void main(String[] args) {
        // Option 1 : unique lines only 
        Set<String> uniqueLinesOnly = new HashSet<>(validationFile);

        // Option 2 : unique lines and duplicate lines 
        Set<String> uniqueLines = new HashSet<>();
        Set<String> duplicateLines = new HashSet<>();
        for (String line : validationFile) {
            if (!uniqueLines.add(line.toLowerCase())) {
                duplicateLines.add(line.toLowerCase());
            }
        }

        // Option 3 : unique lines and duplicate lines by Java Streams
        Set<String> uniquesJava8 = new HashSet<>();
        List<String> duplicatesJava8 = validationFile
                                    .stream()
                                    .filter(element -> !uniquesJava8.add(element.toLowerCase()))
                                    .map(element -> element.toLowerCase())
                                    .collect(Collectors.toList());
    }
}

Upvotes: 3

Eritrean

Reputation: 16498

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;

public class ReadValidationFile {
    public static void main(String[] args){       
        List<String> validationFile = new ArrayList<>();
        try(BufferedReader br = new BufferedReader(new FileReader("validation_small.csv"));){
            String line;
            while((line = br.readLine())!= null){
                validationFile.add(line);
            }
        } catch (FileNotFoundException e) {
            //e.printStackTrace();
            System.out.println("file not found " + e.getMessage());
        } catch (IOException e) {
            e.printStackTrace();
        }
        Set<String> uniques = new HashSet<>();        
        List<String> duplicates = validationFile.stream().filter(i->!uniques.add(i)).collect(Collectors.toList());
        System.out.println(duplicates);
    }
}

Upvotes: 1

Java: check CSV file on duplicate lines using ArrayList

Answers (3)

Related Questions