Reputation: 1227
Let's suppose I have data with the following structure: (year, country, region, values)
Example:
Year, Country, Region, Values
2010 A 1 [1,2,3,...(1000 values)]
2010 A 2 [1,2,3,...(1000 values)]
...
2014 J 5 [1,2,3,...(1000 values)]
There are 5 years, 10 countries with 5 regions each and 1000 values for every combination of year, country, region.
I want to know how to decide if I should use multi-rows or multi-columns to store this kind of data. What are de main differences, if any? What are the advantages of each approach?
There are many possible ways to store this data, for example:
Option 3 seems to be very bad, because there will be 5 years x 1000 columns. Option 4 also seems to be very bad, because I would need to group by every time I need something.
Upvotes: 0
Views: 531
Reputation: 186
You should look into "Tidy Data." The which attempts to be a standard for organizing data values within a dataset.
Principles of Tidy Data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form individual DataFrames.
Based on what you are saying, it seems like multi columns might be the way to go. And possibly several sets of data.
Upvotes: 2
Reputation: 150735
Depending what you want to do. But I would go for multi-row as I feel like pandas is built for handling columnar data. Although, long data format seems to be the preferred in general too. A quick google on 'long' and 'wide' data yields many results on wide-to-long
but not other way around.
This blog post also points out some of the advantages of long over wide data format.
Upvotes: 1