Zoro99
Zoro99

Reputation: 185

Python- Arranging different rows of CSV file by constant column header

I have a CSV file which auto updates some data in following order

A,B,C,D,E,F
4,2,6,4,8,9
D,C,A,B,E,F
6,4,5,8,6,2
E,F,A,C,D
4,2,7,6,5

As you would notice, the header values appear in different order in different rows. At times one of the header column values is missing as well.

Requirement is to sort it with consistent header and all values below that. For e.g.

A,B,C,D,E,F
4,2,6,4,8,9
A,B,C,D,E,F
5,8,4,6,6,2
A,B,C,D,E,F
7, ,6,5,4,2

OR

    A,B,C,D,E,F
    4,2,6,4,8,9
    5,8,4,6,6,2
    7, ,6,5,4,2

I tried sorting it with following code, however it only sorts the first row and later on it prints as it is.

with open('mycsv.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
    fieldnames = ['A','B','C','D','E','F','G']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()
    for row in csv.DictReader(infile):
        writer.writerow(row)

Any pointers on how to achieve this would help. Thanks.

Upvotes: 3

Views: 112

Answers (2)

gboffi
gboffi

Reputation: 25023

I propose a solution where the headers can come into play in any order.

The values are stored into a dictionary, we want to have a count of the (couple of) lines seen so far

d = {}
count = 0

We have a loop on the couples of lines (it's a variation on the grouper from itertools' recipes, next we split the two lines on commas and create a Set containing the current headers

for l1, l2 in zip(*[open(datafile)]*2):
    heads = l1.rstrip().split(',')
    vals  = l2.rstrip().split(',') 
    headset = set(heads) 

We have a loop to increment the lists associated with each dictionary, using d.setdefault to have the correct initial value (note that [None]*0 is the void list []), if a new header comes on stage when. e.g., count is 3 we append to a list containing 3 None, [None, None, None].

    for h, v in zip(heads, vals): 
        d.setdefault(h,([None]*count)).append(v) 

After taking into account the headers that are present in this couple of lines, we take into account the headers that were seen previously, but are not in this c.o.l.

    for h in set(d)-headset:
        d[h].append(None) 

Eventually we increment the counter

    count = count+1           

Now we are ready for the output, we sort the keys, we print the keys and next, once for each couple of lines, we print from the lists associated with the keys

keys = sorted(d.keys())
print(','.join(keys))
for n in range(count):
    print(','.join(' ' if d[k][n] is None else str(d[k][n]) for k in keys))

All together

d = {}
count = 0
for l1, l2 in zip(*[open(datafile)]*2):
    heads = l1.rstrip().split(',')
    vals  = l2.rstrip().split(',') 
    headset = set(heads) 
    for h, v in zip(heads, vals): 
        d.setdefault(h,([None]*count)).append(v) 
    for h in set(d)-headset:
        d[h].append(None) 
    count = count+1           
keys = sorted(d.keys())
print(','.join(keys))
for n in range(count):
    print(','.join(' ' if d[k][n] is None else str(d[k][n]) for k in keys))

Testing It

$ cat dat.csv 
A,B,C,D,E,F
4,2,6,4,8,9
D,C,A,B,E,F
6,4,5,8,6,2
E,F,A,C,D
4,2,7,6,5
$ cat head.py 
d = {}
count = 0

for l1, l2 in zip(*[open('dat.csv')]*2):
    heads = l1.rstrip().split(',')
    vals  = l2.rstrip().split(',') 
    headset = set(heads) 
    for h, v in zip(heads, vals): 
        d.setdefault(h,([None]*count)).append(v) 
    for h in set(d)-headset:
        d[h].append(None) 
    count = count+1

keys = sorted(d.keys())
print(','.join(keys))
for n in range(count):
    print(','.join(' ' if d[k][n] is None else str(d[k][n]) for k in keys))
$ python head.py 
A,B,C,D,E,F
4,2,6,4,8,9
5,8,4,6,6,2
7, ,6,5,4,2
$ 

Upvotes: 1

Patrick Artner
Patrick Artner

Reputation: 51643

You can import your file, and continue to read 2 lines (header + data) and create a dict for them. You add the dict to a list containing all of your data. You get the largest dict (the one that contains the most keys), sort it and write all the data back.

In dicts that miss a key, you can subtitute its value by an empty string:

Create data file:

with open("t.csv","w") as f:
    f.write("""A,B,C,D,E,F
4,2,6,4,8,9
D,C,A,B,E,F
6,4,5,8,6,2
E,F,A,C,D
4,2,7,6,5""")

Then:

# read in data as list of dicts, each dict contains 2 rows worth of data    
data = []
with open("t.csv") as f:
    while True:
        try:
            # get a header line and a data line
            header = next(f).strip().split(",")
            d = next(f).strip().split(",")
            # create a dict from it and append it to your data collection
            data.append( {k:v for k,v in zip(header,d)} )

        except StopIteration:
            print("done")
            break

# get a sorted set of all keys in all dicts:
keys = set()
for k in data:
    keys.update(k)
keys = sorted(keys)

# write the data again
with open("new_t.csv","w") as f:
    # write headers once
    f.write(",".join(keys))
    f.write("\n")
    for d in data:
        f.write(",".join( ( d.get(k,"") for k in keys  )))
        f.write("\n")

# check:
with open("new_t.csv","r") as f:
    print(f.read())

Resulting file:

A,B,C,D,E,F
4,2,6,4,8,9
5,8,4,6,6,2
7,,6,5,4,2

I use python3 style printing - but the code works the same in python 2.7 and 3.x.

Make sure to check that your source file contains header+data rows and no empty ones, else you have to adjust the code to omit empty lines.

Upvotes: 1

Related Questions