Melad Basilius
Melad Basilius

Reputation: 4306

How to determine the delimiter in CSV file

I have a scenario at which i have to parse CSV files from different sources, the parsing code is very simple and straightforward.

        String csvFile = "/Users/csv/country.csv";
        String line = "";
        String cvsSplitBy = ",";
        try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
            while ((line = br.readLine()) != null) {
                // use comma as separator
                String[] country = line.split(cvsSplitBy);
                System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

my problem come from the CSV delimiter character, i have many different formats, some time it is a , sometimes it is a ;

is there is any way to determine the delimiter character before parsing the file

Upvotes: 18

Views: 30900

Answers (6)

Linh K Ha
Linh K Ha

Reputation: 41

While I agree with Lefteris008 that it is not possible to have the function that correctly determine all the cases, we can have a function that is both efficient and give mostly correct result in practice.

def head(filename: str, n: int):
    try:
        with open(filename) as f:
            head_lines = [next(f).rstrip() for x in range(n)]
    except StopIteration:
        with open(filename) as f:
            head_lines = f.read().splitlines()
    return head_lines


def detect_delimiter(filename: str, n=2):
    sample_lines = head(filename, n)
    common_delimiters= [',',';','\t',' ','|',':']
    for d in common_delimiters:
        ref = sample_lines[0].count(d)
        if ref > 0:
            if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
                return d
    return ','

My efficient implementation is based on

  1. Prior knowledge such as list of common delimiter you often work with ',;\t |:' , or even the likely hood of the delimiter to be used so that I often put the regular ',' on the top of the list
  2. The frequency of the delimiter appear in each line of the text file are equal. This is to resolve the problem that if we read a single line and see the frequency to be equal (false detection as Lefteris008) or even the right delimiter to appear less frequent as the wrong one in the first line
  3. The efficient implementation of a head function that read only first n lines from the file
  1. As you increase the number of test sample n, the likely hood that you get a false answer reduce drastically. I often found n=2 to be adequate

Upvotes: 1

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

univocity-parsers supports automatic detection of the delimiter (also line endings and quotes). Just use it instead of fighting with your code:

CsvParserSettings settings = new CsvParserSettings();
settings.detectFormatAutomatically();

CsvParser parser = new CsvParser(settings);
List<String[]> rows = parser.parseAll(new File("/path/to/your.csv"));

// if you want to see what it detected
CsvFormat format = parser.getDetectedFormat();

Disclaimer: I'm the author of this library and I made sure all sorts of corner cases are covered. It's open source and free (Apache 2.0 license)

Hope this helps.

Upvotes: 20

Lefteris008
Lefteris008

Reputation: 906

Yes, but only if the delimiter characters are not allowed to exist as regular text

The most simple answer is to have a list with all the available delimiter characters and try to identify which character is being used. Even though, you have to place some limitations on the files or the person/people that created them. Look a the following two scenarios:

Case 1 - Contents of file.csv

test,test2,test3

Case 2 - Contents of file.csv

test1|test2,3|test4

If you have prior knowledge of the delimiter characters, then you would split the first string using , and the second one using |, getting the same result. But, if you try to identify the delimiter by parsing the file, both strings can be split using the , character, and you would end up with this:

Case 1 - Result of split using ,

test1
test2
test3

Case 2 - Result of split using ,

test1|test2
3|test4

By lacking the prior knowledge of which delimiter character is being used, you cannot create a "magical" algorithm that will parse every combination of text; even regular expressions or counting the number of appearance of a character will not save you.

Worst case

test1,2|test3,4|test5

By looking the text, one can tokenize it by using | as the delimiter. But the frequency of appearance of both , and | are the same. So, from an algorithm's perspective, both results are accurate:

Correct result

test1,2
test3,4
test5

Wrong result

test1
2|test3
4|test5

If you pose a set of guidelines or you can somehow control the generation of the CSV files, then you could just try to find the delimiter used with String.contains() method, employing the aforementioned list of characters. For example:

public class MyClass {

    private List<String> delimiterList = new ArrayList<>(){{
        add(",");
        add(";");
        add("\t");
        // etc...
    }};

    private static String determineDelimiter(String text) {
        for (String delimiter : delimiterList) {
            if(text.contains(delimiter)) {
                return delimiter;
            }
        }
        return "";
    }

    public static void main(String[] args) {
        String csvFile = "/Users/csv/country.csv";
        String line = "";
        String cvsSplitBy = ",";
        String delimiter = "";
        boolean firstLine = true;
        try (BufferedReader br = new BufferedReader(new FileReader(csvFile)))  {
            while ((line = br.readLine()) != null) {
                if(firstLine) {
                    delimiter = determineDelimiter(line);
                    if(delimiter.equalsIgnoreCase("")) {
                        System.out.println("Unsupported delimiter found: " + delimiter);
                        return;
                    }
                    firstLine = false;
                }
                // use comma as separator
                String[] country = line.split(delimiter);
                System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Update

For a more optimized way, in determineDelimiter() method instead of the for-each loop, you can employ regular expressions.

Upvotes: 4

DodgyCodeException
DodgyCodeException

Reputation: 6133

If the delimiter can appear in a data column, then you are asking for the impossible. For example, consider this first line of a CSV file:

one,two:three

This could be either a comma-separated or a colon-separated file. You can't tell which type it is.

If you can guarantee that the first line has all its columns surrounded by quotes, for example if it's always this format:

"one","two","three"

then you may be able to use this logic (although it's not 100% bullet-proof):

if (line.contains("\",\""))
    delimiter = ',';
else if (line.contains("\";\""))
    delimiter = ';';

If you can't guarantee a restricted format like that, then it would be better to pass the delimiter character as a parameter.

Then you can read the file using a widely-known open-source CSV parser such as Apache Commons CSV.

Upvotes: 1

Kai Adelmann
Kai Adelmann

Reputation: 199

That depends....

If your datasets are always the same length and/or the separator NEVER occurs in your datacolumns, you could just read the first line of the file, look at it for the longed for separator, set it and then read the rest of the file using that separator.

Something like

String csvFile = "/Users/csv/country.csv";
String line = "";
String cvsSplitBy = ",";
try (BufferedReader br = new BufferedReader(new FileReader(csvFile))) {
    while ((line = br.readLine()) != null) {
        // use comma as separator
        if (line.contains(",")) {
            cvsSplitBy = ",";
        } else if (line.contains(";")) {
           cvsSplitBy = ";";
        } else {
            System.out.println("Wrong separator!");
        }
        String[] country = line.split(cvsSplitBy);
        System.out.println("Country [code= " + country[4] + " , name=" + country[5] + "]");
    }
} catch (IOException e) {
    e.printStackTrace();
}

Greetz Kai

Upvotes: -1

Balayesu Chilakalapudi
Balayesu Chilakalapudi

Reputation: 1406

Add a condition like this,

String [] country;
if(line.contains(",")
    country = line.split(",");
else if(line.contains(";"))
    country=line.split(";");

Upvotes: -1

Related Questions