klaus
klaus

Reputation: 1227

Should I use multi-rows or multi-columns with Pandas DataFrame?

Let's suppose I have data with the following structure: (year, country, region, values)

Example:

Year, Country, Region, Values
2010     A      1      [1,2,3,...(1000 values)]
2010     A      2      [1,2,3,...(1000 values)]
...
2014     J      5      [1,2,3,...(1000 values)]

There are 5 years, 10 countries with 5 regions each and 1000 values for every combination of year, country, region.

I want to know how to decide if I should use multi-rows or multi-columns to store this kind of data. What are de main differences, if any? What are the advantages of each approach?

There are many possible ways to store this data, for example:

  1. Multi-row (country, region), single column (year) and an array of values
  2. Multi-column (year, country, region) and a single value per row
  3. Multi-row (Country, region), multi column (Year, index of value)
  4. Single row and have one column for year, another for country, another for region and another for the array of values.

Option 3 seems to be very bad, because there will be 5 years x 1000 columns. Option 4 also seems to be very bad, because I would need to group by every time I need something.

Upvotes: 0

Views: 531

Answers (2)

tzujan
tzujan

Reputation: 186

You should look into "Tidy Data." The which attempts to be a standard for organizing data values within a dataset.

Principles of Tidy Data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form individual DataFrames.

Based on what you are saying, it seems like multi columns might be the way to go. And possibly several sets of data.

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

Depending what you want to do. But I would go for multi-row as I feel like pandas is built for handling columnar data. Although, long data format seems to be the preferred in general too. A quick google on 'long' and 'wide' data yields many results on wide-to-long but not other way around.

This blog post also points out some of the advantages of long over wide data format.

Upvotes: 1

Related Questions