james dunbar
james dunbar

Reputation: 1

convert crosstab to columns without using pandas in python

How do I convert the crosstab data from the input file mentioned below into columns based on the input list without using pandas?

Input list

[A,B,C]

Input data file

Labels A,B,C are only for representation, original file only has the numeric values. We can ignore the colums XX & YY based on the length of the input list

  A B C XX YY
A 0 2 3 4  8
B 4 0 6 4  8
C 7 8 0 5  8

Output (Output needs to have labels)

A A 0
A B 2
A C 3
B A 4
B B 0
B C 6
C A 7
C B 8
C C 0

The labels need to be present in the output file even though its present in the input file, hence I have mentioned its representation in the output file.

NB: In reality the labels are sorted city names without duplicates in ascending order & not single alphabets like A or B.

Unfortunately this would have been easier if I could install pandas on the server & use unstack(), but installations aren't allowed on this old server right now. This is on python 3.5

Upvotes: 0

Views: 642

Answers (1)

Grismar
Grismar

Reputation: 31329

Considering you tagged the post csv, I'm assuming the actual input data is a .csv file, without header as you indicated.

So example data would look like:

0,2,3,4,8
4,0,6,4,8
7,8,0,5,8

If the labels are provided as a list, matching the order of the columns and rows (i.e. ['A', 'B', 'C'] this would turn the example output into:

'A','A',0
'A','B',2
'A','C',3
'B','A',4
etc.

Note that this implies the number of rows and columns in the file cannot exceed the number of labels provided.

You indicate that the columns you label 'XX' and 'YY' are to be ignored, but you don't indicate how that's supposed to be communicated, but you do mention the length of the input is determining it, so I assume this means 'everything after column n can be ignored'.

This is a simple implementation:

from csv import reader


def unstack_csv(fn, columns, labels):
    with open(fn) as f:
        cr = reader(f)
        row = 0
        for line in cr:
            col = 0
            for x in line[:columns]:
                yield labels[row], labels[col], x
                col += 1
            row += 1


print(list(unstack_csv('unstack.csv', 3, ['A', 'B', 'C'])))

or if you like it short and sweet:

from csv import reader

with open('unstack.csv') as f:
    content = reader(f)
    labels = ['A', 'B', 'C']
    print([(labels[row], labels[col], x)
           for row, data in enumerate(content)
           for col, x in enumerate(data) if col < 3])

(I'm also assuming using numpy is out, for the same reason as pandas, but that stuff like csv is in, since it's a standard library)

If you don't want to provide the labels explicitly, but just want them generated, you could do something like:

def label(n):
    r = n // 26
    c = chr(65 + (n % 26))
    if r > 0:
        return label(r-1)+c
    else:
        return c

And then of course just remove the labels from the examples and replace with calls to label(col) and label(row).

Upvotes: 1

Related Questions