Reputation: 185
I have a CSV file which auto updates some data in following order
A,B,C,D,E,F
4,2,6,4,8,9
D,C,A,B,E,F
6,4,5,8,6,2
E,F,A,C,D
4,2,7,6,5
As you would notice, the header values appear in different order in different rows. At times one of the header column values is missing as well.
Requirement is to sort it with consistent header and all values below that. For e.g.
A,B,C,D,E,F
4,2,6,4,8,9
A,B,C,D,E,F
5,8,4,6,6,2
A,B,C,D,E,F
7, ,6,5,4,2
OR
A,B,C,D,E,F
4,2,6,4,8,9
5,8,4,6,6,2
7, ,6,5,4,2
I tried sorting it with following code, however it only sorts the first row and later on it prints as it is.
with open('mycsv.csv', 'r') as infile, open('reordered.csv', 'a') as outfile:
fieldnames = ['A','B','C','D','E','F','G']
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in csv.DictReader(infile):
writer.writerow(row)
Any pointers on how to achieve this would help. Thanks.
Upvotes: 3
Views: 112
Reputation: 25023
I propose a solution where the headers can come into play in any order.
The values are stored into a dictionary, we want to have a count of the (couple of) lines seen so far
d = {}
count = 0
We have a loop on the couples of lines (it's a variation on the
grouper
from itertools'
recipes,
next we split the two lines on commas and create a Set containing the
current headers
for l1, l2 in zip(*[open(datafile)]*2):
heads = l1.rstrip().split(',')
vals = l2.rstrip().split(',')
headset = set(heads)
We have a loop to increment the lists associated with each dictionary,
using d.setdefault
to have the correct initial value (note that
[None]*0
is the void list []
), if a new header comes on stage
when. e.g., count
is 3 we append to a list containing 3 None
,
[None, None, None]
.
for h, v in zip(heads, vals):
d.setdefault(h,([None]*count)).append(v)
After taking into account the headers that are present in this couple of lines, we take into account the headers that were seen previously, but are not in this c.o.l.
for h in set(d)-headset:
d[h].append(None)
Eventually we increment the counter
count = count+1
Now we are ready for the output, we sort the keys, we print the keys and next, once for each couple of lines, we print from the lists associated with the keys
keys = sorted(d.keys())
print(','.join(keys))
for n in range(count):
print(','.join(' ' if d[k][n] is None else str(d[k][n]) for k in keys))
All together
d = {}
count = 0
for l1, l2 in zip(*[open(datafile)]*2):
heads = l1.rstrip().split(',')
vals = l2.rstrip().split(',')
headset = set(heads)
for h, v in zip(heads, vals):
d.setdefault(h,([None]*count)).append(v)
for h in set(d)-headset:
d[h].append(None)
count = count+1
keys = sorted(d.keys())
print(','.join(keys))
for n in range(count):
print(','.join(' ' if d[k][n] is None else str(d[k][n]) for k in keys))
Testing It
$ cat dat.csv
A,B,C,D,E,F
4,2,6,4,8,9
D,C,A,B,E,F
6,4,5,8,6,2
E,F,A,C,D
4,2,7,6,5
$ cat head.py
d = {}
count = 0
for l1, l2 in zip(*[open('dat.csv')]*2):
heads = l1.rstrip().split(',')
vals = l2.rstrip().split(',')
headset = set(heads)
for h, v in zip(heads, vals):
d.setdefault(h,([None]*count)).append(v)
for h in set(d)-headset:
d[h].append(None)
count = count+1
keys = sorted(d.keys())
print(','.join(keys))
for n in range(count):
print(','.join(' ' if d[k][n] is None else str(d[k][n]) for k in keys))
$ python head.py
A,B,C,D,E,F
4,2,6,4,8,9
5,8,4,6,6,2
7, ,6,5,4,2
$
Upvotes: 1
Reputation: 51643
You can import your file, and continue to read 2 lines (header + data) and create a dict for them. You add the dict to a list containing all of your data. You get the largest dict (the one that contains the most keys), sort it and write all the data back.
In dicts that miss a key, you can subtitute its value by an empty string:
Create data file:
with open("t.csv","w") as f:
f.write("""A,B,C,D,E,F
4,2,6,4,8,9
D,C,A,B,E,F
6,4,5,8,6,2
E,F,A,C,D
4,2,7,6,5""")
Then:
# read in data as list of dicts, each dict contains 2 rows worth of data
data = []
with open("t.csv") as f:
while True:
try:
# get a header line and a data line
header = next(f).strip().split(",")
d = next(f).strip().split(",")
# create a dict from it and append it to your data collection
data.append( {k:v for k,v in zip(header,d)} )
except StopIteration:
print("done")
break
# get a sorted set of all keys in all dicts:
keys = set()
for k in data:
keys.update(k)
keys = sorted(keys)
# write the data again
with open("new_t.csv","w") as f:
# write headers once
f.write(",".join(keys))
f.write("\n")
for d in data:
f.write(",".join( ( d.get(k,"") for k in keys )))
f.write("\n")
# check:
with open("new_t.csv","r") as f:
print(f.read())
Resulting file:
A,B,C,D,E,F
4,2,6,4,8,9
5,8,4,6,6,2
7,,6,5,4,2
I use python3 style printing - but the code works the same in python 2.7 and 3.x.
Make sure to check that your source file contains header+data rows and no empty ones, else you have to adjust the code to omit empty lines.
Upvotes: 1