user3443113
user3443113

Reputation:

Python - csv column restructuring

I'm totally new to python.

I have a csv file. Its structure is:

Ip, Tag, Sentence_id, scheme
#1, yes,      1,        1
#1, yes,      2,        2 
#2, no,       1,        1
#3  maybe,    3,        3

There are 100 sentence_id, a number of ips and 6 schemes (1-6).

Is there a way with python to restructure the csv so that it's output is a matrix of the following structure:

Ip,   scheme 1, scheme 2, scheme 3, sentence_id
#1      yes,      NA,          NA,      1
#1      NA,       yes,         NA,      2
#2      no,       NA,          NA,      1
#3      NA,       NA,         maybe,    3

I'm not sure if python is the "right" language to use for something like this. I've been guided to either python or awk, but have no idea of either. Thanks.

Upvotes: 0

Views: 397

Answers (2)

horns
horns

Reputation: 1933

This should read through the file line by line, reorder columns, etc.

Just replace in_filename, out_filename, NUM_SCHEMES, and the header with whatever you want.

from csv import reader, writer

with open("in_filename") as in_file, open("out_filename") as out_file:
    in_csv = csv.reader(in_file)
    out_csv = csv.writer(out_file)

    next(in_csv) # Skip header
    out_csv.writerow(["Ip", "scheme 1", "scheme 2", "scheme 3", "sentence_id"])

    for row in in_csv:
        ip, tag, sentence_id, scheme = row

        out_row = [ip]

        for i in range(NUM_SCHEMES):
            out_row.append(tag if i == scheme else "NA")

        out_row.append(sentence_id)

        out_csv.writerow(out_row)

Upvotes: 1

Jon Clements
Jon Clements

Reputation: 142126

You can use csv.DictReader to load the file into a list of dictionaries, then from that, find the maximum scheme value to build your output field names. For each row in the list, set the scheme N field to be equal to the value of the Tag column. Then we use csv.DictWriter to fill any missing values with NA where a key is not present, eg:

import csv

with open('input.csv', 'rb') as fin, open('output.csv', 'wb') as fout:
    rows = list(csv.DictReader(fin, skipinitialspace=True))
    schemes = range(1, max(int(row['scheme']) for row in rows) + 1)
    fieldnames = ['Ip'] + ['scheme {}'.format(i) for i in schemes] + ['Sentence_id']
    csvout = csv.DictWriter(fout, fieldnames=fieldnames, extrasaction='ignore', restval='NA')
    csvout.writeheader()
    for row in rows:
        row['scheme {}'.format(row['scheme'].strip())] = row['Tag']
        csvout.writerow(row)

This gives the following output given your example input:

Ip,scheme 1,scheme 2,scheme 3,Sentence_id
#1,yes,NA,NA,1
#1,NA,yes,NA,2
#2,no,NA,NA,1
#3,NA,NA,maybe,3

Upvotes: 1

Related Questions