Python CSV writer for dynamic headers

Question

I have many dictionaries which are not consistent in term of keys, I have to write them to CSV.

While using DictWriter, it requires to define the headers at declaration which I don't know. I also can't store them in list till I have had all the dict as the count of dict run into millions.

I'm in search of library that I can keep on writing CSV with inconsistent rows, and it manages them below the hood.

Data :

{'a':'1','i':'1','l':None}
{'b':'1','k':'1','y':None}
{'g':'1','k':'1','j':None}
{'b':'1','h':'1','c':None}
{'b':'1','h':'1','n':None}
{'a':'1','b':'1','v':None}
{'a':'1','b':'1','c':None}

Expected Output: (CSV like this)

a   c   b   g   i   h   k   j   l   n   v   y
1               1                           
        1               1                   
            1           1                   
        1           1                       
        1           1                       
1       1                                   
1       1

This data on top, is coming from iterator, can't store it in memory as it's very large.

Moses Koledoye · Accepted Answer

If you don't know all the header fields and you think it's inefficient to iterate the iterable of dicts to collect the header fields, then you should not be using DictWriter.

You can instead keep track of all seen fields, writing them as they appear in each dict and appending new fields to the end of each new row; so that newer fields are at the tail of the rows in the file. Existing fields not found in new rows are written as blanks.

Here's a toy code in Python 3 that demonstrates the concept:

import io
import csv

# supposedly lengthy iterator containing dicts
it = iter([{'name': 'Bob', 'house': 5, 'cell': 8090}, 
           {'name': 'Lisa', 'class': 12, 'age': 53}, 
           {'done': False, 'flat': 6}])
# simulate file
s = io.StringIO()
writer = csv.writer(s)

header, header_set = [], set()
writer.writerow('') # place holder for header
for row in it:
    for key in row:
       if key not in header_set:
           header_set.add(key)
           header.append(key)
    writer.writerow(row.get(col, '') for col in header)


# TEST: recover written file as iterable of dicts using DictReader
s.seek(0)
reader = csv.DictReader(s, fieldnames=header, restval='')
for row in reader:
    print(row)

OrderedDict([('name', 'Bob'), ('house', '5'), ('cell', '8090'), ('class', ''), ('age', ''), ('done', ''), ('flat', '')])
OrderedDict([('name', 'Lisa'), ('house', ''), ('cell', ''), ('class', '12'), ('age', '53'), ('done', ''), ('flat', '')])
OrderedDict([('name', ''), ('house', ''), ('cell', ''), ('class', ''), ('age', ''), ('done', 'False'), ('flat', '6')])

If you need to update the existing file with the header, you can do write the header to a new file, and write all the rows from the previous file into the successive lines:

new_file.write(','.join(header)+s.read())

However, considering your initial claim that the number of rows is quite large, you can avoid the memory intensive file.read, by manually copying the header into the existing file using a file editor :)

Python CSV writer for dynamic headers

Answers (1)

Related Questions