I have a dataframe that looks like this: country region region_id year doy variable_a num_pixels 0 USA Iowa 12345 2022 1 32.2 100 1 USA Iowa 12345 2022 2 12.2 100 2 USA Iowa 12345 2022 3 22.2 100 3 USA Iowa 12345 2022 4 112.2 100 4 USA Iowa 12345 2022 5 52.2 100 The year in the dataframe above is 2022. I have more dataframes for other years starting from 2010 onwards. I have also dataframes for other variables: variable_b , variable_c . I want to combine all these dataframes into a single dataframe such that The years are listed vertically, one below the other the data for the different variables is listed horizontally. The output should look like this: country region region_id year doy variable_a variable_b variable_c 0 USA Iowa 12345 2010 1 32.2 44 101 1 USA Iowa 12345 2010 2 12.2 76 2332 .......................................................................... n-1 USA Iowa 12345 2022 1 321.2 444 501 n USA Iowa 12345 2022 2 122.2 756 32 What is the most efficient way to achieve this? Please note that there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.

Reputation: 21961

Stack dataframes in Pandas vertically and horizontally

I have a dataframe that looks like this:

  country region  region_id  year  doy  variable_a  num_pixels
0     USA   Iowa      12345  2022    1        32.2         100
1     USA   Iowa      12345  2022    2        12.2         100
2     USA   Iowa      12345  2022    3        22.2         100
3     USA   Iowa      12345  2022    4       112.2         100
4     USA   Iowa      12345  2022    5        52.2         100

The year in the dataframe above is 2022. I have more dataframes for other years starting from 2010 onwards. I have also dataframes for other variables: variable_b, variable_c.

I want to combine all these dataframes into a single dataframe such that

The years are listed vertically, one below the other
the data for the different variables is listed horizontally. The output should look like this:

  country region  region_id  year  doy  variable_a  variable_b  variable_c
0     USA   Iowa      12345  2010    1        32.2          44         101
1     USA   Iowa      12345  2010    2        12.2          76        2332
..........................................................................
n-1   USA   Iowa      12345  2022    1       321.2         444         501
n     USA   Iowa      12345  2022    2       122.2         756          32

What is the most efficient way to achieve this? Please note that there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.

Upvotes: 2

Answers (4)

Bryan Dannowitz

Reputation: 1216

I'm not sure people are hearing the second parts of your question:

the data for the different variables is listed horizontally.

and

there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.

I think I understand, and this is my solution.

We start by creating a baby dataset of two years, five days each, with two variables.

import pandas as pd

# Baseline dummy data
data = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

# 2022 data with "a" data
df_2022_a = pd.DataFrame(data)
df_2022_a["variable_a"] = range(5)

# 2022 data with "b" data
df_2022_b = pd.DataFrame(data)
df_2022_b["variable_b"] = range(5, 10)

# 2021 data with "a" data
df_2021_a = pd.DataFrame(data)
df_2021_a["variable_a"] = range(10, 15)
df_2021_a["year"] = 2021

# 2021 data with "b" data
df_2021_b = pd.DataFrame(data)
df_2021_b["variable_b"] = range(15, 20)
df_2021_b["year"] = 2021

frames = [df_2022_a, df_2022_b, df_2021_a, df_2021_b]

# Get the columns that they all share. This is what we'll group by.
# You can hard-code this if you want
common_cols = list(set.intersection(*(set(df.columns) for df in frames)))

# Yes, go ahead and concatenate them together... but there's one more step!
df = pd.concat(frames)
df

Here, you're left with a lot of duplicate days and a lot of NaNs. Collapse your dataframe by doing something like the following:

output_df = (
    df
    .groupby(by=common_cols)  # Only keep distinct values for the common cols
    .max()                    # Max will prefer non-nan values over nans
    .reset_index()            # Collapse the multi-index
    .sort_values(common_cols) # Sort by all these to get it nice and orderly
    .reset_index(drop=True)   # Tidy up the dataframe index
)
output_df

I believe this is the type of output that OP is asking for.

As for there being no NaNs in the final product, that'll really depend on the data coverage over all variable for all years and days.

Upvotes: 2

DialFrost

Reputation: 1770

All methods

@René

res = pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)

Timing (average of 1000 runs):

0.003601184606552124

@vignesh kanakavalli

res = pd.concat(df_list)
res = res.sort_values(by="year").reset_index(drop=True)

Timing (average of 1000 runs):

0.002223911762237549

@DialFrost

def fast_flatten(input_list, df):
  r = list(chain.from_iterable(input_list))
  r += [np.nan] * (len(df.index)*3 - len(r))
  return list(r)

def combine_lists(frames):
  COLUMN_NAMES = [frames[i].columns for i in range(len(frames))]
  COL_NAMES = list(set(list(chain(*COLUMN_NAMES))))
  df_dict = dict.fromkeys(COL_NAMES, [])
  for col in COL_NAMES:
    extracted = (frame[col] for frame in frames if col in frame.columns.tolist())
    df_dict[col] = fast_flatten(extracted, dfn[0])
  return pd.DataFrame.from_dict(df_dict)[COL_NAMES]

res = combine_lists(dfn)
res = res.sort_values(by = "year").reset_index(drop=True)

Timing (1000 runs):

0.0021250741481781007

Explanation of my code:

Here I used a trick. Instead of using pd.concat, I decided to go for appending. Especially for larger dataframes, use the appending method found on github here - (I used a slightly modified version of the code). This one is slightly large enough to beat pd.concat and wins in efficiency.

All tests use the same dataframe:

import pandas as pd

data1 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data2 = {
    'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
    'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data3 = {
    'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
    'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

dfn = [df1, df2, df3]

All tests used this timing control:

for x in range(1000):
  start = time.time()
  .
  .
  .
  end = time.time()
  lst.append(end-start)
print(sum(lst)/len(lst))
print(res)

Upvotes: 0

René

Reputation: 4827

IIUC, this should work for you:

data1 = {
    'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
    'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data2 = {
    'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
    'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

data3 = {
    'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
    'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
    'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
    'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
    'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
    'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
    'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)

dfn = [df1, df2, df3]

pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)

Output:

Upvotes: 2

vignesh kanakavalli

Reputation: 527

Use pd.concat method to do this efficiently. The method does the work by listing all the data frames in vertical order and also creates new columns for all the new variables.

Here is an example of how pd.concat works I created with duplicate data.

CODE

import pandas as pd

df1 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
                    "region_id": [12345, 12345, 12345], "year": [2022, 2022, 2022], "doy": [1, 2, 3],
                    "variable_a": [32.2, 12.2, 22.2], "num_pixles": [100, 100, 100]})

df2 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
                    "region_id": [12345, 12345, 12345], "year": [2020, 2020, 2020], "doy": [1, 2, 3],
                    "variable_b": [54.2, 62.2, 2.2], "num_pixles": [100, 100, 100]})

df_list = [df1, df2]  # list of dataframes

res = pd.concat(df_list) # concat the list of dataframes
res = res.sort_values(by="year").reset_index(drop=True)  # To make sure that the rows are sorted based on year
print(res)

OUTPUT

      country region  region_id  year  doy  variable_a  num_pixles  variable_b
0     USA   Iowa      12345  2020    1         NaN         100        54.2
1     USA   Iowa      12345  2020    2         NaN         100        62.2
2     USA   Iowa      12345  2020    3         NaN         100         2.2
3     USA   Iowa      12345  2022    1        32.2         100         NaN
4     USA   Iowa      12345  2022    2        12.2         100         NaN
5     USA   Iowa      12345  2022    3        22.2         100         NaN

Upvotes: 1

Stack dataframes in Pandas vertically and horizontally

Answers (4)

All methods

@René

@vignesh kanakavalli

@DialFrost

Related Questions