Let treatment of NA-values depend on their number relative to the number of available values among groups in a data frame, in R

Question

I have got a dataset containing contracts between states. The number of contracting states varies from 2 to 94. In another data frame, each state is attributed a value called “polity” - although for some, this value is missing.

With the help of this forum, I merged the two data frames, and then summarized the contracts by taking the difference of the min() and max() "polity"-values of the contracting states.

Now, I don't want to either ignore or exclude NA-values. I want to treat the polity value of a contract as NA if the number of NA-values among contracting states exceeds a certain fraction of the number of contracting states (for these data frames, it is most convenient to say that 4/5 of the polity-values must be available in order for the contract to be taken in the analysis).

These are two simplified versions of my data sets:

treaties <- data.frame(treaty.ID=c(1,1,2,2,3,3,3,4,4,4,4,4),
                   Treaty=c("hungary slovenia 1994", "hungary slovenia 1994",
                            "taiwan hungary 1994", "taiwan hungary 1994", 
                            "Treaty of Izmir 1977", "Treaty of Izmir 1977",
                            "Treaty of Izmir 1977", "Treaty of Five 1909", 
                            "Treaty of Five 1909", "Treaty of Five 1909",
                            "Treaty of Five 1909","Treaty of Five 1909"),
                   scode=c("HUN","SLV","TAW","HUN", "IRN", "TUR", "PAK", 
                           "AUS","AUL","NEW","USA","CAN"),
                   year=c(1994, 1994, 1994, 1994, 1977, 1977, 1977, 1909, 
                          1909, 1909, 1909, 1909),
                   pr.dem=c(1,1,0,0,0,0,0,1,1,1,1,1))

POL <- data.frame(country=c("Hungary", "Slovenia", "Taiwan","Austria",
                           "Australia", "New Zealand", "USA", "Canada",
                           "Iran","Turkey", "Pakistan"),
                 scode=c("HUN", "SLV", "TAW", "AUS", "AUL", "NEW", "USA",
                         "CAN", "IRN", "TUR", "PAK"),
                 year=c(1994, 1994, 1994, 1909, 1909, 1909, 1909, 1909,
                        1977, 1977, 1977),
                 polity = c(7, NA, 9, 8, 8, 10, 10, NA, -10, 9, NA))

(Hence, only treaties 1 and 3 should show NA for "polity" in the end)

I joined them together, reduced multiple rows with the same treaty to one while taking the difference of the maximum and minium of the polity values:

require(dplyr)
left_join(treaties, POL, c("scode","year")) %>%
                                group_by(Treaty) %>% 
                               summarise(PolityDiff=max(polity)-min(polity))

I would like to know if it is possible to let the treatment of NA-values depend on their number as opposed to the number of available values in a grouped data frame?

I tried to include an ifelse-function:

DIFF <- left_join(treaties, Polity, c("scode","year")) %>%
                       group_by(DIFF, File)

summarise(DIFF, polity.Diff=max(polity, na.rm = ifelse(length(polity = NA) >= 0.2*length(polity), TRUE, FALSE))-
            min(polity, na.rm = ifelse(length(polity = NA) >= 0.2*length(polity), TRUE, FALSE)))

but it returns the error:

Error: index out of bounds

Can I use the ifelse() function after “na.rm = ” ? Did I make a mistake? I would really appreciate your help.

talat · Accepted Answer

This should do what you want:

left_join(treaties, POL, c("scode","year")) %>%
  group_by(Treaty) %>%
  summarise(polity.Diff = max(polity, na.rm = sum(is.na(polity)) >= 0.2*n()) -
                          min(polity, na.rm = sum(is.na(polity)) >= 0.2*n()))
#Source: local data frame [4 x 2]
#
#                 Treaty polity.Diff
#1 hungary slovenia 1994           0
#2   taiwan hungary 1994           2
#3   Treaty of Five 1909           2
#4  Treaty of Izmir 1977          19

First of all, I use is.na() instead of length(XX = NA), secondly I use dplyr's special function n() instead of length(polity) and thirdly, I removed the ifelse and only left the logical test there - it will return TRUE or FALSE according to the specification. Note that in 3 of the cases, NA's will be removed and in one case (taiwan hungary 1994) they are not removed because there are not NAs at all in that group - that's why you end up without any NAs in polity.Diff column.

You'll probably notice that you do the same logical test for both max and min - which might be solved more efficiently by first creating a new variable, e.g. NAcheck, in your data and then just referring to that variable in the na.rm = definition. However, you'd also need to remove that variable afterwards (e.g. using select(-NAcheck)).

Let treatment of NA-values depend on their number relative to the number of available values among groups in a data frame, in R

Answers (1)

Related Questions