deck
deck

Reputation: 880

How to Pandas read_csv multiple records per line

(I'm a pandas n00b) I have some oddly formatted CSV data that resembles this:

i   A           B           C       
    x   y   z   x   y   z   x   y   z
-------------------------------------
1   1   2   3   4   5   6   7   8   9
2   1   2   3   3   2   1   2   1   3
3   9   8   7   6   5   4   3   2   1

where A, B, C are categorical and the properties x, y, z are present for each. What I think I want to do (part of a larger split-apply-combine step) is to read data with Pandas such that I have dimensionally homogenous observations like this:

i   id  GRP   x   y   z
-----------------------
1   1   A     1   2   3
2   1   B     4   5   6
3   1   C     7   8   9
4   2   A     1   2   3
5   2   B     3   2   1
6   2   C     2   1   3
7   3   A     9   8   7
8   3   B     6   5   4
9   3   C     3   2   1

So how best to accomplish this?

#1: I thought about reading the file using basic read_csv() options, then iterating/ slicing/transposing/whatever to create another dataframe that has the structure i want. But in my case the number of categories (A,B,C) and properties (x,y,z) is large and is not known ahead of time. I'm also worried about memory issues if scaling to large datasets.

#2: I like the idea of setting the iterator param in read_csv() and then yielding multiple observations per line. (any reason y not set chunksize=1?) I wouldn't be creating multiple dataframes this way at least.

What's the smarter way to do this?

Upvotes: 0

Views: 408

Answers (1)

Zhenhao Chen
Zhenhao Chen

Reputation: 535

First I constructed the sample dataframe like yours:

column = pd.MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
                       labels=[[i for i in range(3) for _ in range(3)], [0, 1, 2]*3])

df = pd.DataFrame(np.random.randint(1,10, size=(3, 9)),
                  columns=column, index=[1, 2, 3])

print df

#    A        B        C      
#    x  y  z  x  y  z  x  y  z
# 1  5  7  4  7  7  8  9  1  9
# 2  8  5  1  8  5  9  4  4  2
# 3  4  9  6  2  1  4  6  1  6

To get your desired output, reshape the dataframe using df.stack() and then reset the index:

df = df.stack(0).reset_index()

df.index += 1    # to make index begin from 1

print df

#    level_0 level_1  x  y  z
# 1        1       A  5  7  4
# 2        1       B  7  7  8
# 3        1       C  9  1  9
# 4        2       A  8  5  1
# 5        2       B  8  5  9
# 6        2       C  4  4  2
# 7        3       A  4  9  6
# 8        3       B  2  1  4
# 9        3       C  6  1  6

Then you can just rename the columns as you want. Hope it helps.

Upvotes: 1

Related Questions