Reputation: 91
I have a dataset that shows bilateral exports for several countries. Because the data fluctuates, I need to calculate the mean of year groups. All the countries do not cover exactly the years. Some start later, some have gaps in between - this means, some years are missing (but without having NA entries). I already managed to cut the data into pieces whith the help of an amazing community member: year_group.
Below I am listing two further problems along with my code, the wrong output and on the bottom some sample input data for the dataset total_trade
Problem 1
I am facing the issue, that the code does not calculate the right means. When I calculate the results manually, I get different results than my code. (see below)
This is my code
# create vectors for coding 4 years average
year_group_break <- c(1999, 2003, 2007, 2011, 2015, 2019)
year_group_labels <- c("1999-2002", "2003-2006", "2007-2010", "2011-2014", "2015-2018")
years <- c(1999, 2000, 2001, 2002,2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)
FourY_av <- total_trade %>%
# create year_group variable for average values with above predefined labels and cuts,
# chose right = FALSE to take cut before year_group_break
mutate(year_group = cut(Year, breaks = year_group_break,
labels = year_group_labels,
include.lowest = TRUE, right = FALSE)) %>%
# add column with mean of total trade per four year period: "avg_year_group_total"
group_by(ReporterName, year_group) %>%
mutate(total_year_group = mean(Total_Year)) %>%
arrange(ReporterName,PartnerName, desc(Year))
View(FourY_av)
Below is the wrong output This output is wrong because total_year_group (the mean of the year group '2015-2018' for Angola) should be 34746013.5 (when calculated manually) instead of 34907582. (as in the output) Where is my mistake?
> head(FourY_av)
# A tibble: 6 x 9
# Groups: ReporterName, year_group [1]
Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group total_year_group
<int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
1 2018 Angola Afghanistan AFG 19.4 42096736. 0.0000460 2015-2018 34907582.
2 2017 Angola Afghanistan AFG 2.25 34904881. 0.00000644 2015-2018 34907582.
3 2016 Angola Afghanistan AFG 0.775 28057500. 0.00000276 2015-2018 34907582.
4 2015 Angola Afghanistan AFG 39.6 33924937. 0.000117 2015-2018 34907582.
5 2018 Angola Albania ALB 2.38 42096736. 0.00000565 2015-2018 34907582.
6 2017 Angola Albania ALB 39.7 34904881. 0.000114 2015-2018 34907582.
Problem 2
Another issue is that not all countries display data for all years. SOme start later, some have gaps. I still need the means for the same year groups in order to ensure comparability. The data set has no NA's. The data is just missing.
E.g. Angola does not cover the year 2008. The data set does not include NA's but does not include the row and value for 2008 for Angola. Other countries are displaying data for 2008. I still need a mean for the available years for Angola in the column total_year_group (by taking the mean for the years 2007, 2009 & 2010). This shouldn't be an issue with the mean function, right? Or do I need to consider something particular in this case?
Here is some sample input data of total_trade
dput(head(total_trade, n = 100))
structure(list(Year = c(2015L, 2018L, 2017L, 2016L, 2017L, 2015L,
2018L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L, 2018L, 2018L,
2015L, 2016L, 2017L, 2016L, 2015L, 2017L, 2018L, 2018L, 2017L,
2016L, 2015L, 2018L, 2014L, 2015L, 2016L, 2017L, 2017L, 2018L,
2016L, 2015L, 2016L, 2018L, 2017L, 2015L, 2010L, 2009L, 2016L,
2013L, 2014L, 2018L, 2017L, 2015L, 2016L, 2017L, 2018L, 2017L,
2018L, 2016L, 2016L, 2018L, 2007L, 2013L, 2009L, 2018L, 2015L,
2016L, 2014L, 2010L, 2017L, 2012L, 2011L, 2018L, 2016L, 2015L,
2016L, 2011L, 2018L, 2017L, 2015L, 2015L, 2016L, 2018L, 2017L,
2015L, 2015L, 2016L, 2018L, 2017L, 2007L, 2014L, 2010L, 2013L,
2011L, 2009L, 2012L, 2017L, 2018L, 2016L, 2015L, 2015L, 2015L,
2017L, 2016L, 2018L, 2015L), ReporterName = c("Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola"
), PartnerName = c("Afghanistan", "Afghanistan", "Afghanistan",
"Afghanistan", "Albania", "Albania", "Albania", "Algeria", "Algeria",
"Algeria", "Algeria", "American Samoa", "Andorra", "Andorra",
"Antigua and Barbuda", "Antigua and Barbuda", "Antigua and Barbuda",
"Antigua and Barbuda", "Argentina", "Argentina", "Argentina",
"Argentina", "Armenia", "Armenia", "Armenia", "Armenia", "Australia",
"Australia", "Australia", "Australia", "Australia", "Austria",
"Austria", "Austria", "Austria", "Azerbaijan", "Azerbaijan",
"Azerbaijan", "Azerbaijan", "Bahamas, The", "Bahamas, The", "Bahamas, The",
"Bahamas, The", "Bahamas, The", "Bahamas, The", "Bahamas, The",
"Bahamas, The", "Bahrain", "Bahrain", "Bahrain", "Bangladesh",
"Bangladesh", "Bangladesh", "Barbados", "Belarus", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belgium", "Belgium",
"Belgium", "Belgium", "Belgium", "Belgium", "Belize", "Belize",
"Belize", "Benin", "Benin", "Benin", "Benin", "Benin", "Bhutan",
"Bolivia", "Bolivia", "Bolivia", "Bolivia", "Botswana", "Botswana",
"Botswana", "Botswana", "Brazil", "Brazil", "Brazil", "Brazil",
"Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil",
"British Virgin Islands", "Brunei", "Bulgaria", "Bulgaria", "Bulgaria",
"Bulgaria"), PartnerISO3 = c("AFG", "AFG", "AFG", "AFG", "ALB",
"ALB", "ALB", "DZA", "DZA", "DZA", "DZA", "ASM", "AND", "AND",
"ATG", "ATG", "ATG", "ATG", "ARG", "ARG", "ARG", "ARG", "ARM",
"ARM", "ARM", "ARM", "AUS", "AUS", "AUS", "AUS", "AUS", "AUT",
"AUT", "AUT", "AUT", "AZE", "AZE", "AZE", "AZE", "BHS", "BHS",
"BHS", "BHS", "BHS", "BHS", "BHS", "BHS", "BHR", "BHR", "BHR",
"BGD", "BGD", "BGD", "BRB", "BLR", "BEL", "BEL", "BEL", "BEL",
"BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BEL", "BLZ", "BLZ",
"BLZ", "BEN", "BEN", "BEN", "BEN", "BEN", "BTN", "BOL", "BOL",
"BOL", "BOL", "BWA", "BWA", "BWA", "BWA", "BRA", "BRA", "BRA",
"BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "BRA", "VGB",
"BRN", "BGR", "BGR", "BGR", "BGR"), `TradeValue in 1000 USD` = c(39.586,
19.353, 2.248, 0.775, 39.723, 2.259, 2.38, 2169.123, 2322.463,
2241.599, 245.226, 12.007, 5.975, 0.326, 422.006, 155.467, 47.018,
54.774, 483.147, 142.23, 98.7, 61.362, 60.105, 30.494, 0.99,
0.731, 40220.092, 45435.804, 16096.404, 8546.882, 1904.301, 627.179,
433.699, 23.118, 5.124, 985.67, 600.371, 143.356, 9.926, 140139.415,
108214.936, 64444.203, 100210.999, 52974.059, 7322.893, 145.791,
26.995, 4.847, 5.187, 1.958, 125.722, 55.22, 2.75, 3.366, 54.31,
107976.895, 123610.469, 66757.2, 67763.201, 50046.64, 40199.706,
52383.95, 45614.873, 28690.458, 52907.343, 39328.574, 452.078,
5.82, 0.32, 970.324, 1700.981, 804.478, 332.216, 69.342, 1.632,
1530.58, 308.752, 62.569, 19.822, 55.241, 37.029, 16.917, 0.198,
874217.786, 1032751.313, 509259.955, 428750.075, 333280.441,
192964.08, 315316.932, 119947.132, 141486.749, 66556.728, 1273.093,
5.064, 22.324, 158.252, 33.583, 8.435, 0.077), Total_Year = c(33924937.48,
42096736.31, 34904881.111, 28057499.527, 34904881.111, 33924937.48,
42096736.31, 28057499.527, 33924937.48, 34904881.111, 42096736.31,
42096736.31, 34904881.111, 42096736.31, 42096736.31, 33924937.48,
28057499.527, 34904881.111, 28057499.527, 33924937.48, 34904881.111,
42096736.31, 42096736.31, 34904881.111, 28057499.527, 33924937.48,
42096736.31, 58672369.19, 33924937.48, 28057499.527, 34904881.111,
34904881.111, 42096736.31, 28057499.527, 33924937.48, 28057499.527,
42096736.31, 34904881.111, 33924937.48, 52612114.76, 40639411.73,
28057499.527, 67712526.544, 58672369.19, 42096736.31, 34904881.111,
33924937.48, 28057499.527, 34904881.111, 42096736.31, 34904881.111,
42096736.31, 28057499.527, 28057499.527, 42096736.31, 44177783.072,
67712526.544, 40639411.73, 42096736.31, 33924937.48, 28057499.527,
58672369.19, 52612114.76, 34904881.111, 70863076.416, 66427390.221,
42096736.31, 28057499.527, 33924937.48, 28057499.527, 66427390.221,
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527,
42096736.31, 34904881.111, 33924937.48, 33924937.48, 28057499.527,
42096736.31, 34904881.111, 44177783.072, 58672369.19, 52612114.76,
67712526.544, 66427390.221, 40639411.73, 70863076.416, 34904881.111,
42096736.31, 28057499.527, 33924937.48, 33924937.48, 33924937.48,
34904881.111, 28057499.527, 42096736.31, 33924937.48), pct_by_partner_year = c(0.000116687024179005,
4.59726850497024e-05, 6.44035999679013e-06, 2.7621848456389e-06,
0.000113803567683494, 6.65881846158674e-06, 5.65364493454718e-06,
0.0077309918437765, 0.00684588733986371, 0.00642202158738646,
0.000582529719629944, 2.8522401146684e-05, 1.71179497245645e-05,
7.74406827169068e-07, 0.00100246726228929, 0.000458267609458834,
0.000167577299448064, 0.000156923611416451, 0.00172198880208503,
0.000419249114560196, 0.000282768474948037, 0.00014576426910659,
0.000142778289407966, 8.73631395649993e-05, 3.5284683834613e-06,
2.15475710288619e-06, 0.0955420669759755, 0.0774398658640565,
0.0474471147057807, 0.0304620231456308, 0.00545568682484317,
0.00179682319502973, 0.00103024376238159, 8.23950829180388e-05,
1.51039335091503e-05, 0.00351303578942051, 0.00142616994243657,
0.000410704736521284, 2.92587127267419e-05, 0.2663633948935,
0.266280763902189, 0.229686194730164, 0.147994771595005, 0.0902879153020956,
0.0173953936620509, 0.000417680838208199, 7.95727332317548e-05,
1.72752386410474e-05, 1.48603858110989e-05, 4.65119192514428e-06,
0.000360184581635431, 0.000131174064405754, 9.80130106517029e-06,
1.19967925037684e-05, 0.000129012376636663, 0.244414471464133,
0.18255184868885, 0.164267141570654, 0.160970200874938, 0.147521686751831,
0.143276153177212, 0.0892821454514031, 0.0867003221749986, 0.0821961201035531,
0.0746613690455784, 0.0592053577133712, 0.00107390272887404,
2.07431171633786e-05, 9.43258923288073e-07, 0.00345834096536738,
0.0025606620918584, 0.00191102225615742, 0.000951775194258733,
0.000204398313308255, 4.81062050876917e-06, 0.00545515468521031,
0.000733434529761055, 0.000179255731601051, 5.84289949294256e-05,
0.000162833019316739, 0.000131975409869888, 4.01860131755188e-05,
5.6725590719059e-07, 1.97886296054109, 1.76020046106476, 0.967951882039117,
0.633191666126054, 0.50172141324715, 0.474820062066878, 0.444966473299775,
0.34363999584631, 0.336099093188823, 0.237215465105691, 0.00375267603882995,
1.49270724610338e-05, 6.58041006358842e-05, 0.000453380716286491,
0.00011969348860786, 2.00371827827334e-05, 2.26971678416193e-07
)), row.names = c(NA, -100L), groups = structure(list(Year = c(2007L,
2007L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 2011L, 2011L,
2011L, 2012L, 2012L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L,
2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2016L,
2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L
), ReporterName = c("Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola", "Angola", "Angola", "Angola",
"Angola", "Angola", "Angola", "Angola"), PartnerName = c("Belgium",
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Bahamas, The",
"Belgium", "Brazil", "Belgium", "Benin", "Brazil", "Belgium",
"Brazil", "Bahamas, The", "Belgium", "Brazil", "Australia", "Bahamas, The",
"Belgium", "Brazil", "Afghanistan", "Albania", "Algeria", "Antigua and Barbuda",
"Argentina", "Armenia", "Australia", "Austria", "Azerbaijan",
"Bahamas, The", "Belgium", "Belize", "Benin", "Bhutan", "Bolivia",
"Botswana", "Brazil", "British Virgin Islands", "Brunei", "Bulgaria",
"Afghanistan", "Algeria", "Antigua and Barbuda", "Argentina",
"Armenia", "Australia", "Austria", "Azerbaijan", "Bahamas, The",
"Bahrain", "Bangladesh", "Barbados", "Belgium", "Belize", "Benin",
"Bolivia", "Botswana", "Brazil", "Bulgaria", "Afghanistan", "Albania",
"Algeria", "Andorra", "Antigua and Barbuda", "Argentina", "Armenia",
"Australia", "Austria", "Azerbaijan", "Bahamas, The", "Bahrain",
"Bangladesh", "Belgium", "Benin", "Bolivia", "Botswana", "Brazil",
"Bulgaria", "Afghanistan", "Albania", "Algeria", "American Samoa",
"Andorra", "Antigua and Barbuda", "Argentina", "Armenia", "Australia",
"Austria", "Azerbaijan", "Bahamas, The", "Bahrain", "Bangladesh",
"Belarus", "Belgium", "Belize", "Benin", "Bolivia", "Botswana",
"Brazil", "Bulgaria"), .rows = structure(list(56L, 84L, 41L,
58L, 89L, 40L, 63L, 86L, 66L, 71L, 88L, 65L, 90L, 43L, 57L,
87L, 28L, 44L, 62L, 85L, 1L, 6L, 9L, 16L, 20L, 26L, 29L,
35L, 39L, 47L, 60L, 69L, 74L, 75L, 79L, 80L, 94L, 95L, 96L,
100L, 4L, 8L, 17L, 19L, 25L, 30L, 34L, 36L, 42L, 48L, 53L,
54L, 61L, 68L, 70L, 76L, 81L, 93L, 98L, 3L, 5L, 10L, 13L,
18L, 21L, 24L, 31L, 32L, 38L, 46L, 49L, 51L, 64L, 73L, 78L,
83L, 91L, 97L, 2L, 7L, 11L, 12L, 14L, 15L, 22L, 23L, 27L,
33L, 37L, 45L, 50L, 52L, 55L, 59L, 67L, 72L, 77L, 82L, 92L,
99L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), row.names = c(NA, 100L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"))
Upvotes: 0
Views: 71
Reputation: 26218
The issue with mean
is duplicated rows for any ReporterName
in the data.
Problem-1
total_trade %>%
# create year_group variable for average values with above predefined labels and cuts,
# chose right = FALSE to take cut before year_group_break
mutate(year_group = cut(Year, breaks = year_group_break,
labels = year_group_labels,
include.lowest = TRUE, right = FALSE)) %>%
# add column with mean of total trade per four year period: "avg_year_group_total"
group_by(ReporterName, year_group) %>%
mutate(dup = !duplicated(paste0(ReporterName, year_group, Total_Year)),
total_year_group = sum(Total_Year * dup)/sum(dup)) %>%
arrange(ReporterName,PartnerName, desc(Year))
# A tibble: 100 x 10
# Groups: ReporterName, year_group [3]
Year ReporterName PartnerName PartnerISO3 `TradeValue in 1000 USD` Total_Year pct_by_partner_year year_group dup total_year_group
<int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct> <lgl> <dbl>
1 2018 Angola Afghanistan AFG 19.4 42096736. 0.0000460 2015-2018 TRUE 34746014.
2 2017 Angola Afghanistan AFG 2.25 34904881. 0.00000644 2015-2018 TRUE 34746014.
3 2016 Angola Afghanistan AFG 0.775 28057500. 0.00000276 2015-2018 TRUE 34746014.
4 2015 Angola Afghanistan AFG 39.6 33924937. 0.000117 2015-2018 TRUE 34746014.
5 2018 Angola Albania ALB 2.38 42096736. 0.00000565 2015-2018 FALSE 34746014.
6 2017 Angola Albania ALB 39.7 34904881. 0.000114 2015-2018 FALSE 34746014.
7 2015 Angola Albania ALB 2.26 33924937. 0.00000666 2015-2018 FALSE 34746014.
8 2018 Angola Algeria DZA 245. 42096736. 0.000583 2015-2018 FALSE 34746014.
9 2017 Angola Algeria DZA 2242. 34904881. 0.00642 2015-2018 FALSE 34746014.
10 2016 Angola Algeria DZA 2169. 28057500. 0.00773 2015-2018 FALSE 34746014.
# ... with 90 more rows
Problem-2
Use complete
from tidyr
. If you can show a desired output, I may show you how.
Upvotes: 1