Reputation: 21961
I have a dataframe that looks like this:
country region region_id year doy variable_a num_pixels
0 USA Iowa 12345 2022 1 32.2 100
1 USA Iowa 12345 2022 2 12.2 100
2 USA Iowa 12345 2022 3 22.2 100
3 USA Iowa 12345 2022 4 112.2 100
4 USA Iowa 12345 2022 5 52.2 100
The year in the dataframe above is 2022. I have more dataframes for other years starting from 2010 onwards.
I have also dataframes for other variables: variable_b
, variable_c
.
I want to combine all these dataframes into a single dataframe such that
country region region_id year doy variable_a variable_b variable_c
0 USA Iowa 12345 2010 1 32.2 44 101
1 USA Iowa 12345 2010 2 12.2 76 2332
..........................................................................
n-1 USA Iowa 12345 2022 1 321.2 444 501
n USA Iowa 12345 2022 2 122.2 756 32
What is the most efficient way to achieve this? Please note that there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.
Upvotes: 2
Views: 1888
Reputation: 1216
I'm not sure people are hearing the second parts of your question:
the data for the different variables is listed horizontally.
and
there will be overlap in years in the other dataframes so the solution needs to take that into account and not leave NaN values.
I think I understand, and this is my solution.
We start by creating a baby dataset of two years, five days each, with two variables.
import pandas as pd
# Baseline dummy data
data = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
# 2022 data with "a" data
df_2022_a = pd.DataFrame(data)
df_2022_a["variable_a"] = range(5)
# 2022 data with "b" data
df_2022_b = pd.DataFrame(data)
df_2022_b["variable_b"] = range(5, 10)
# 2021 data with "a" data
df_2021_a = pd.DataFrame(data)
df_2021_a["variable_a"] = range(10, 15)
df_2021_a["year"] = 2021
# 2021 data with "b" data
df_2021_b = pd.DataFrame(data)
df_2021_b["variable_b"] = range(15, 20)
df_2021_b["year"] = 2021
frames = [df_2022_a, df_2022_b, df_2021_a, df_2021_b]
# Get the columns that they all share. This is what we'll group by.
# You can hard-code this if you want
common_cols = list(set.intersection(*(set(df.columns) for df in frames)))
# Yes, go ahead and concatenate them together... but there's one more step!
df = pd.concat(frames)
df
Here, you're left with a lot of duplicate days and a lot of NaN
s. Collapse your dataframe by doing something like the following:
output_df = (
df
.groupby(by=common_cols) # Only keep distinct values for the common cols
.max() # Max will prefer non-nan values over nans
.reset_index() # Collapse the multi-index
.sort_values(common_cols) # Sort by all these to get it nice and orderly
.reset_index(drop=True) # Tidy up the dataframe index
)
output_df
I believe this is the type of output that OP is asking for.
As for there being no NaN
s in the final product, that'll really depend on the data coverage over all variable for all years and days.
Upvotes: 2
Reputation: 1770
res = pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)
Timing (average of 1000 runs):
0.003601184606552124
res = pd.concat(df_list)
res = res.sort_values(by="year").reset_index(drop=True)
Timing (average of 1000 runs):
0.002223911762237549
def fast_flatten(input_list, df):
r = list(chain.from_iterable(input_list))
r += [np.nan] * (len(df.index)*3 - len(r))
return list(r)
def combine_lists(frames):
COLUMN_NAMES = [frames[i].columns for i in range(len(frames))]
COL_NAMES = list(set(list(chain(*COLUMN_NAMES))))
df_dict = dict.fromkeys(COL_NAMES, [])
for col in COL_NAMES:
extracted = (frame[col] for frame in frames if col in frame.columns.tolist())
df_dict[col] = fast_flatten(extracted, dfn[0])
return pd.DataFrame.from_dict(df_dict)[COL_NAMES]
res = combine_lists(dfn)
res = res.sort_values(by = "year").reset_index(drop=True)
Timing (1000 runs):
0.0021250741481781007
Explanation of my code:
Here I used a trick. Instead of using pd.concat
, I decided to go for appending. Especially for larger dataframes, use the appending method found on github here - (I used a slightly modified version of the code). This one is slightly large enough to beat pd.concat
and wins in efficiency.
All tests use the same dataframe:
import pandas as pd
data1 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data2 = {
'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data3 = {
'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dfn = [df1, df2, df3]
All tests used this timing control:
for x in range(1000):
start = time.time()
.
.
.
end = time.time()
lst.append(end-start)
print(sum(lst)/len(lst))
print(res)
Upvotes: 0
Reputation: 4827
IIUC, this should work for you:
data1 = {
'country': {0: 'USA', 1: 'USA', 2: 'USA', 3: 'USA', 4: 'USA'},
'region': {0: ' Iowa', 1: ' Iowa', 2: ' Iowa', 3: ' Iowa', 4: ' Iowa'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2022, 1: 2022, 2: 2022, 3: 2022, 4: 2022},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_a': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data2 = {
'country': {0: 'USB', 1: 'USB', 2: 'USB', 3: 'USB', 4: 'USB'},
'region': {0: ' Iowb', 1: ' Iowb', 2: ' Iowb', 3: ' Iowb', 4: ' Iowb'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2021, 1: 2021, 2: 2021, 3: 2021, 4: 2021},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_b': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
data3 = {
'country': {0: 'USC', 1: 'USC', 2: 'USC', 3: 'USC', 4: 'USC'},
'region': {0: ' Iowc', 1: ' Iowc', 2: ' Iowc', 3: ' Iowc', 4: ' Iowc'},
'region_id': {0: 12345, 1: 12345, 2: 12345, 3: 12345, 4: 12345},
'year': {0: 2020, 1: 2020, 2: 2020, 3: 2020, 4: 2020},
'doy': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'variable_c1': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'variable_c2': {0: 32.2, 1: 12.2, 2: 22.2, 3: 112.2, 4: 52.2},
'num_pixels': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100}
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
dfn = [df1, df2, df3]
pd.concat(dfn, axis=0).sort_values(['year', 'country', 'region']).reset_index(drop=True)
Output:
Upvotes: 2
Reputation: 527
Use pd.concat
method to do this efficiently. The method does the work by listing all the data frames in vertical order and also creates new columns for all the new variables.
Here is an example of how pd.concat
works I created with duplicate data.
CODE
import pandas as pd
df1 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
"region_id": [12345, 12345, 12345], "year": [2022, 2022, 2022], "doy": [1, 2, 3],
"variable_a": [32.2, 12.2, 22.2], "num_pixles": [100, 100, 100]})
df2 = pd.DataFrame({"country": ["USA", "USA", "USA"], "region": ["Iowa", "Iowa", "Iowa"],
"region_id": [12345, 12345, 12345], "year": [2020, 2020, 2020], "doy": [1, 2, 3],
"variable_b": [54.2, 62.2, 2.2], "num_pixles": [100, 100, 100]})
df_list = [df1, df2] # list of dataframes
res = pd.concat(df_list) # concat the list of dataframes
res = res.sort_values(by="year").reset_index(drop=True) # To make sure that the rows are sorted based on year
print(res)
OUTPUT
country region region_id year doy variable_a num_pixles variable_b
0 USA Iowa 12345 2020 1 NaN 100 54.2
1 USA Iowa 12345 2020 2 NaN 100 62.2
2 USA Iowa 12345 2020 3 NaN 100 2.2
3 USA Iowa 12345 2022 1 32.2 100 NaN
4 USA Iowa 12345 2022 2 12.2 100 NaN
5 USA Iowa 12345 2022 3 22.2 100 NaN
Upvotes: 1