Reputation: 18094

How to skip pre header lines with csv.DictReader?

I want to csv.DictReader to deduce the field names from the file. The docs say "If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.", but in my case the first row containts a title and the 2nd row which contains the names.

I can't apply next(reader) as per Python 3.2 skip a line in csv.DictReader because the fieldname assignment takes place when initializing the reader (or I'm doing it wrong).

The csvfile (exported from Excel 2010, original source):

CanVec v1.1.0,,,,,,,,,^M
Entity,Attributes combination,"Specification Code
Point","Specification Code
Line","Specification Code
Area",Generic Code,Theme,"GML - Entity name
Shape - File name
Point","GML - Entity name
Shape - File name
Line","GML - Entity name
Shape - File name
Area"^M
Amusement park,Amusement park,,,2260012,2260009,LX,,,LX_2260009_2^M
Auto wrecker,Auto wrecker,,,2360012,2360009,IC,,,IC_2360009_2^M

My code:

f = open(entities_table,'rb')
try:
    dialect = csv.Sniffer().sniff(f.read(1024))
    f.seek(0)

    reader = csv.DictReader(f, dialect=dialect)
    print 'I think the field names are:\n%s\n' % (reader.fieldnames)

    i = 0
    for row in reader:
        if i < 20:
            print row
            i = i + 1

finally:
    f.close()

Current results:

I think the field names are:
['CanVec v1.1.0', '', '', '', '', '', '', '', '', '']

Desired result:

I think the field names are:
['Entity','Attributes combination','"Specification Code Point"',...snip]

I realize it would be expedient to simply delete the first row and carry on, but I'm trying to get as close to just reading the data in situ as I can and minimize manual intervention.

Upvotes: 15

Answers (3)

NeilG

Reputation: 4160

General solution

This is a more general solution for when it's not certain at which row the headers will arrive.

This solution also cleans up the headers - expecting the first "word" (space separated) as the actual column header and trailing characters to be treated as comments (and dropped).

Not possible wholly within csv.DictReader

It's not possible to handle skip lines inside csv.DictReader. Firstly it doesn't have a "number of skip lines" feature (like Pandas) but also, we don't know how many lines it's going to be. So it's determined that external processing is required. The file contents need to be read and processed.

Clean solution

Suggestions for improvement are welcome, but this solution avoids importing unnecessary libraries and does not use broken hacks like closing a file within a context manager and re-opening it (!).

Python 3

import csv

FILENAME = "data.csv"


with open(FILENAME, newline="", encoding="utf-8") as csvfile:
    for line in csvfile:
        if "known_field_name" in line:
            break
    else:
        sys.exit("Error: Headers not found")
    fieldnames = [(field.strip().split() or [""])[0] for field in line.split(",")]
    rows = [row for row in csv.DictReader(csvfile, fieldnames=fieldnames)]

print("Field names are: {}".format(fieldnames)) 

for row in rows:
    print(row)

Note this code will not produce the results expected in the question because the column headers include spaces in the test data, and this solution has a specific feature to clean column headers up.

Just drop that feature if spaces are allowed in column headers.

Slurping whole file

This solution slurps the whole file in order to discard the file handle early, as it's intended for a smaller data set. If a large data set is in play then keep processing within the csvfile context manager and don't slurp the whole file into an initial list.

Field names processing

The (field.strip().split() or [""])[0] step is used to get a clean column header. Because field.strip() can produce an empty string, then field.strip().split() can also produce an empty list, and [][0] would then raise an exception.

This is where ... or [""] ensures there is a list of length of 1 or greater to avoid raising the IndexError.

Indexing the first item is necessary to separate the first word from other commentary in the column header cell (no newlines or commas allowed, see below).

Vulnerabilities

The field names processing has necessarily been done manually, outside of csv. In particular, the simple approach of split on comma is expected to provide all column headers from a single line.

Do not allow commas inside column headers
Do not allow line breaks inside column headers

If either or both of these two rules are broken this code will not "realise" and processing will break.

Upvotes: 0

Pavel Trufanov

Reputation: 26

I used islice from itertools. My header was in the last line of a big preamble. I have passed preamble and used hederline for fieldnames:

with open(file, "r") as f:
    '''Pass preamble'''
    n = 0
    for line in f.readlines():
        n += 1
        if 'same_field_name' in line: # line with field names was found
            h = line.split(',')
            break
    f.close()
    f = islice(open(i, "r"), n, None)

    reader = csv.DictReader(f, fieldnames = h)

Upvotes: 1

Wooble

Reputation: 90007

After f.seek(0), insert:

next(f)

to advance the file pointer to the second line before initializing the DictReader.

Upvotes: 19