Should I use multi-rows or multi-columns with Pandas DataFrame?

Question

Let's suppose I have data with the following structure: (year, country, region, values)

Example:

Year, Country, Region, Values
2010     A      1      [1,2,3,...(1000 values)]
2010     A      2      [1,2,3,...(1000 values)]
...
2014     J      5      [1,2,3,...(1000 values)]

There are 5 years, 10 countries with 5 regions each and 1000 values for every combination of year, country, region.

I want to know how to decide if I should use multi-rows or multi-columns to store this kind of data. What are de main differences, if any? What are the advantages of each approach?

There are many possible ways to store this data, for example:

Multi-row (country, region), single column (year) and an array of values
Multi-column (year, country, region) and a single value per row
Multi-row (Country, region), multi column (Year, index of value)
Single row and have one column for year, another for country, another for region and another for the array of values.

Option 3 seems to be very bad, because there will be 5 years x 1000 columns. Option 4 also seems to be very bad, because I would need to group by every time I need something.

tzujan · Accepted Answer

You should look into "Tidy Data." The which attempts to be a standard for organizing data values within a dataset.

Principles of Tidy Data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form individual DataFrames.

Based on what you are saying, it seems like multi columns might be the way to go. And possibly several sets of data.

Should I use multi-rows or multi-columns with Pandas DataFrame?

Answers (2)

Related Questions