rom
rom

Reputation: 3672

Can I import a CSV file and automatically infer the delimiter?

I want to import two kinds of CSV files, some use ";" for delimiter and others use ",". So far I have been switching between the next two lines:

reader=csv.reader(f,delimiter=';')

or

reader=csv.reader(f,delimiter=',')

Is it possible not to specify the delimiter and to let the program check for the right delimiter?

The solutions below (Blender and sharth) seem to work well for comma-separated files (generated with Libroffice) but not for semicolon-separated files (generated with MS Office). Here are the first lines of one semicolon-separated file:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

Upvotes: 56

Views: 72590

Answers (6)

valq
valq

Reputation: 11

If csv.Sniffer is not fit your needs, following up on @twalberg's idea, here's two, possible implementation of identify the right delimiter, but not just checking for common ,,; and | delimiters, but universally identifying any kind of weird delimiters in csv-like files.

The naive approach

Below code reads the csv file's first 10 rows, grabs any non-alpha-numeric characters, and count the frequencies.

This code solely based on the law of greater numbers, that the most popular non-alphanumeric char should be generally the delimiter.

import re
from collections import Counter
def delimiter_detector(file_path):
    sample_data = ""
    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                sample_data += file.readline()
                i += 1
            except StopIteration:
                break

    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    delimiters_frequency = Counter(non_alnum_chars)
    if len(delimiters_frequency) == 0:
        return None

    # Find and return the most common delimiter
    most_common_delimiter = delimiters_frequency.most_common(1)[0][0]
    return most_common_delimiter

print(delimiter_detector('test.csv'))

Of course this would break, if let's say we have 5 columns delimited by | (4 times / row), but also 5+ . char in each row in the following 9 rows. {'|': 10*4, '.': 9*5}

Less naive approach

Hence, a better approach would be to first check and count all the special chars in the header/first row,the do the same in the succeeding rows.

After identify the special chars in the first row, then check, that out of those characters, which one's frequency is matching most of the time in the rest of the rows.

Continuing above example, even in the worst case, if we have 4 | and 4 . in the header row, which would imply that | and . can be the delimiter, by checking the next n rows, generally '|':4 will be the most frequent, while . and other special chars would vary.

import re
from collections import Counter

def frequency_counter(sample_data):
    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    return dict(Counter(non_alnum_chars))
def delimiter_detector(file_path):
    possible_delimiters = []

    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                freqeunt_nonalpha = frequency_counter(file.readline().strip())
                possible_delimiters.append(freqeunt_nonalpha)
                i += 1
            except StopIteration:
                break


    if len(possible_delimiters) == 0:
        return None

    # Find the most common delimiter in the header row
    potential_delimiters = []
    header_row = possible_delimiters[0]
    # adding potential delimiter to the list if it's in the header row and the frequencies are equal
    for data_row in possible_delimiters[1:]:
        for data_row_delim in data_row:
            if data_row_delim in header_row:
                # if the header and data row frequencies are equal, it's a potential delimiter
                if header_row[data_row_delim] == data_row[data_row_delim]:
                    potential_delimiters.append(data_row_delim)

    # identify the most common potential delimiter
    most_common_delimiter = Counter(potential_delimiters).most_common()
    print(most_common_delimiter)
    return most_common_delimiter[0][0][0]

print(delimiter_detector('test.csv'))

this approach would work, where the first, naive approach would fail.

c1|c2|c3|c4|c5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

The only case, where this would fail, if the other special char is present in the header row, and occurs in the other rows as well, exactly the same amount of times as the actual delimiter char:

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

In this case . is just as valid delimiter as |. However, if there is even one row, where the count is not exactly the same as in the header row the latter approach would successfully identify the right delimiter. For that reason, it might worth to increase the number of rows to check.

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
a.|b.|c.|d..|e.

Some additional things to consider

The mentioned approaches assuming that the CSV file has a header row. If the file does not have a header, the method would treat the first data row as the header, leading to potential mistakes in delimiter detection.

Another thing that frequently causes issues for me, is the file encoding. Especially, if you're working on Windows, Excel likes to save it with Windows-1250 encoding. So make sure, that you are defined the right encoding when reading the file. To detect encoding, you can use the chardet library.

Upvotes: 1

Bill Lynch
Bill Lynch

Reputation: 81966

The csv module seems to recommend using the csv sniffer for this problem.

They give the following example, which I've adapted for your case.

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

Let's try it out.

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

And our sample inputs

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

And if we execute the example program:

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

It's also probably worth noting what version of python I'm using.

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2

Upvotes: 65

Vladir Parrado Cruz
Vladir Parrado Cruz

Reputation: 2359

And if you're using DictReader you can do that:

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])

I used this with Python 3.5 and it worked this way.

Upvotes: 7

Andrew Basile
Andrew Basile

Reputation: 896

Given a project that deals with both , (comma) and | (vertical bar) delimited CSV files, which are well formed, I tried the following (as given at https://docs.python.org/2/library/csv.html#csv.Sniffer):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

However, on a |-delimited file, the "Could not determine delimiter" exception was returned. It seemed reasonable to speculate that the sniff heuristic might work best if each line has the same number of delimiters (not counting whatever might be enclosed in quotes). So, instead of reading the first 1024 bytes of the file, I tried reading the first two lines in their entirety:

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

So far, this is working well for me.

Upvotes: 14

rom
rom

Reputation: 3672

To solve the problem, I have created a function which reads the first line of a file (header) and detects the delimiter.

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"

Upvotes: 10

twalberg
twalberg

Reputation: 62409

I don't think there can be a perfectly general solution to this (one of the reasons I might use , as a delimiter is that some of my data fields need to be able to include ;...). A simple heuristic for deciding might be to simply read the first line (or more), count how many , and ; characters it contains (possibly ignoring those inside quotes, if whatever creates your .csv files quotes entries properly and consistently), and guess that the more frequent of the two is the right delimiter.

Upvotes: 2

Related Questions