Reputation: 880
(I'm a pandas n00b) I have some oddly formatted CSV data that resembles this:
i A B C
x y z x y z x y z
-------------------------------------
1 1 2 3 4 5 6 7 8 9
2 1 2 3 3 2 1 2 1 3
3 9 8 7 6 5 4 3 2 1
where A
, B
, C
are categorical and the properties x
, y
, z
are present for each. What I think I want to do (part of a larger split-apply-combine step) is to read data with Pandas such that I have dimensionally homogenous observations like this:
i id GRP x y z
-----------------------
1 1 A 1 2 3
2 1 B 4 5 6
3 1 C 7 8 9
4 2 A 1 2 3
5 2 B 3 2 1
6 2 C 2 1 3
7 3 A 9 8 7
8 3 B 6 5 4
9 3 C 3 2 1
#1: I thought about reading the file using basic read_csv()
options, then iterating/ slicing/transposing/whatever to create another dataframe that has the structure i want. But in my case the number of categories (A
,B
,C
) and properties (x
,y
,z
) is large and is not known ahead of time. I'm also worried about memory issues if scaling to large datasets.
#2: I like the idea of setting the iterator
param in read_csv()
and then yielding multiple observations per line. (any reason y not set chunksize=1?) I wouldn't be creating multiple dataframes this way at least.
What's the smarter way to do this?
Upvotes: 0
Views: 408
Reputation: 535
First I constructed the sample dataframe like yours:
column = pd.MultiIndex(levels=[['A', 'B', 'C'], ['x', 'y', 'z']],
labels=[[i for i in range(3) for _ in range(3)], [0, 1, 2]*3])
df = pd.DataFrame(np.random.randint(1,10, size=(3, 9)),
columns=column, index=[1, 2, 3])
print df
# A B C
# x y z x y z x y z
# 1 5 7 4 7 7 8 9 1 9
# 2 8 5 1 8 5 9 4 4 2
# 3 4 9 6 2 1 4 6 1 6
To get your desired output, reshape the dataframe using df.stack()
and then reset the index:
df = df.stack(0).reset_index()
df.index += 1 # to make index begin from 1
print df
# level_0 level_1 x y z
# 1 1 A 5 7 4
# 2 1 B 7 7 8
# 3 1 C 9 1 9
# 4 2 A 8 5 1
# 5 2 B 8 5 9
# 6 2 C 4 4 2
# 7 3 A 4 9 6
# 8 3 B 2 1 4
# 9 3 C 6 1 6
Then you can just rename the columns as you want. Hope it helps.
Upvotes: 1