Reputation:
I'm totally new to python.
I have a csv file. Its structure is:
Ip, Tag, Sentence_id, scheme
#1, yes, 1, 1
#1, yes, 2, 2
#2, no, 1, 1
#3 maybe, 3, 3
There are 100 sentence_id, a number of ips and 6 schemes (1-6).
Is there a way with python to restructure the csv so that it's output is a matrix of the following structure:
Ip, scheme 1, scheme 2, scheme 3, sentence_id
#1 yes, NA, NA, 1
#1 NA, yes, NA, 2
#2 no, NA, NA, 1
#3 NA, NA, maybe, 3
I'm not sure if python is the "right" language to use for something like this. I've been guided to either python or awk, but have no idea of either. Thanks.
Upvotes: 0
Views: 397
Reputation: 1933
This should read through the file line by line, reorder columns, etc.
Just replace in_filename, out_filename, NUM_SCHEMES, and the header with whatever you want.
from csv import reader, writer
with open("in_filename") as in_file, open("out_filename") as out_file:
in_csv = csv.reader(in_file)
out_csv = csv.writer(out_file)
next(in_csv) # Skip header
out_csv.writerow(["Ip", "scheme 1", "scheme 2", "scheme 3", "sentence_id"])
for row in in_csv:
ip, tag, sentence_id, scheme = row
out_row = [ip]
for i in range(NUM_SCHEMES):
out_row.append(tag if i == scheme else "NA")
out_row.append(sentence_id)
out_csv.writerow(out_row)
Upvotes: 1
Reputation: 142126
You can use csv.DictReader
to load the file into a list of dictionaries, then from that, find the maximum scheme
value to build your output field names. For each row in the list, set the scheme N
field to be equal to the value of the Tag
column. Then we use csv.DictWriter
to fill any missing values with NA
where a key is not present, eg:
import csv
with open('input.csv', 'rb') as fin, open('output.csv', 'wb') as fout:
rows = list(csv.DictReader(fin, skipinitialspace=True))
schemes = range(1, max(int(row['scheme']) for row in rows) + 1)
fieldnames = ['Ip'] + ['scheme {}'.format(i) for i in schemes] + ['Sentence_id']
csvout = csv.DictWriter(fout, fieldnames=fieldnames, extrasaction='ignore', restval='NA')
csvout.writeheader()
for row in rows:
row['scheme {}'.format(row['scheme'].strip())] = row['Tag']
csvout.writerow(row)
This gives the following output given your example input:
Ip,scheme 1,scheme 2,scheme 3,Sentence_id
#1,yes,NA,NA,1
#1,NA,yes,NA,2
#2,no,NA,NA,1
#3,NA,NA,maybe,3
Upvotes: 1