Reputation: 121
I posted a similar question a week ago but I failed to identify the real problem. Therefore, the question was far from being correct.
Now, I clearly now what is going on but I cannot understand why it is happening. I also reviewed similar problems related with the same error but the solutions for these problems were not applicable to my case.
I am plotting the frequency distribution of a variable during the fieldwork progress of a survey. Therefore, it shows how the proportion of that variables has changed through time.
So, I have a variable (Startday) that tells which day the respondent took the survey, if he/she did not then it is NA. Then, I have the typical variables like sex or marital status.
This is the code to plot such graph
df %>%
mutate(date = lubridate::mdy(startday)) %>%
arrange(date) %>%
mutate(Rs = cumsum(sf_sex %in% c("Male", "Female")),
female_Rs = cumsum(sf_sex == "Female")) %>%
group_by(date) %>%
slice(n()) %>%
select(date, Rs, female_Rs) %>%
mutate(female_prop = female_Rs/Rs) %>%
ggplot(aes(x = date, y = female_prop)) +
geom_point() +
geom_line()
And this is what I get.
Exactly what I want. The problem comes when I am using Marital status as a variable (and that variable has the same nature than the other: dummy and character). This is what I get using the following code:
df %>%
mutate(date = lubridate::mdy(startday)) %>%
arrange(date) %>%
mutate(Rs = cumsum(Maritaldummy %in% c("Not married", "Married")),
Married_Rs = cumsum(Maritaldummy == "Married")) %>%
group_by(date) %>%
slice(n()) %>%
select(date, Rs, Married_Rs) %>%
mutate(Married_prop = Married_Rs/Rs) %>%
ggplot(aes(x = date, y = Married_prop)) +
geom_point() +
geom_line()
Warning messages: 1: Removed 34 rows containing missing values (geom_point). 2: Removed 34 row(s) containing missing values (geom_path).
As you can see the observations stop around the 5th of June.
Things to consider:
The strange part comes when this code works for experimental groups 2 and 3 (n = 350 each one) but not for experimental group 1 (n= 2050). I do believe the error has to come from here as when I random sample less than 1300 observations for group 1... it works!!! This is an example of the same code for group 2.
I am giving you a reproducible example but I am afraid the error only works when using it with the full sample but maybe you discover what is wrong with it?
Thanks a lot for the attention, time & help.
df <- structure(list(startday = c("06/02/2019", "05/22/2019", "05/28/2019",
"05/26/2019", "06/03/2019", "06/10/2019", "05/22/2019", "05/30/2019",
"05/31/2019", "06/18/2019", "05/22/2019", "05/25/2019", "05/25/2019",
"05/22/2019", "06/14/2019", "06/14/2019", "05/20/2019", "05/27/2019",
"05/20/2019", "05/21/2019", "05/20/2019", "05/20/2019", "06/09/2019",
"06/12/2019", "05/24/2019", "05/20/2019", "05/20/2019", "05/28/2019",
"06/09/2019", "05/20/2019", "06/21/2019", "06/03/2019", "06/07/2019",
"05/26/2019", "05/28/2019", "06/03/2019", "06/06/2019", "06/05/2019",
"05/27/2019", "06/10/2019", "05/20/2019", "06/05/2019", "05/20/2019",
"06/04/2019", "05/23/2019", "05/20/2019", "06/11/2019", "05/28/2019",
"06/09/2019", "06/15/2019", "05/25/2019", "06/14/2019", "05/20/2019",
"06/05/2019", "06/04/2019", "06/10/2019", "06/16/2019", "06/05/2019",
"06/29/2019", "05/30/2019", "06/03/2019", "06/09/2019", "05/20/2019",
"05/25/2019", "06/16/2019", "06/14/2019", "05/21/2019", "05/28/2019",
"06/09/2019", "06/07/2019", "05/25/2019", "05/20/2019", "05/27/2019",
"05/20/2019", "05/21/2019", "05/20/2019", "06/17/2019", "06/26/2019",
"06/07/2019", "05/22/2019", "06/19/2019", "06/04/2019", "05/21/2019",
"05/21/2019", "05/21/2019", "06/14/2019", "05/25/2019", "06/19/2019",
"05/20/2019", "06/03/2019", "05/20/2019", "06/04/2019", "05/20/2019",
"05/27/2019", "05/22/2019", "05/20/2019", "06/02/2019", "05/21/2019",
"05/23/2019", "06/03/2019", "06/14/2019", "06/14/2019", "06/07/2019",
"05/20/2019", "05/23/2019", "06/24/2019", "06/03/2019", "05/20/2019",
"06/06/2019", "06/15/2019", "06/06/2019", "05/27/2019", "05/24/2019",
"05/22/2019", "05/20/2019", "05/30/2019", "06/23/2019", "05/21/2019",
"05/20/2019", "06/16/2019", "05/20/2019", "05/24/2019", "05/21/2019",
"05/21/2019", "06/20/2019", "05/20/2019", "05/22/2019", "06/06/2019",
"05/20/2019", "05/21/2019", "06/15/2019", "05/27/2019", "05/26/2019",
"06/06/2019", "05/20/2019", "06/05/2019", "06/02/2019", "06/20/2019",
"05/22/2019", "05/20/2019", "06/03/2019", "05/20/2019", "06/03/2019",
"05/20/2019", "06/03/2019", "05/22/2019", "05/20/2019", "05/22/2019",
"05/22/2019", "05/20/2019", "05/20/2019", "05/23/2019", "05/23/2019",
"05/23/2019", "06/05/2019", "06/08/2019", "06/03/2019", "05/24/2019",
"06/05/2019", "06/02/2019", "05/20/2019", "05/29/2019", "06/04/2019",
"05/21/2019", "06/08/2019", "06/12/2019", "05/30/2019", "06/05/2019",
"06/12/2019", "05/20/2019", "05/20/2019", "06/26/2019", "05/20/2019",
"06/04/2019", "05/20/2019", "06/06/2019", "05/24/2019", "05/24/2019",
"06/06/2019", "06/22/2019", "05/26/2019", "05/29/2019", "05/27/2019",
"05/20/2019", "05/23/2019", "05/21/2019", "05/22/2019", "05/22/2019",
"06/11/2019", "06/05/2019", "06/05/2019", "05/28/2019", "05/23/2019",
"06/13/2019", "05/20/2019", "06/07/2019", "05/28/2019", "06/12/2019",
"06/28/2019", "06/15/2019"), sf_sex = c("Female", "Male", "Male",
"Male", "Male", "Female", "Female", "Female", "Female", "Female",
"Female", "Male", "Female", "Male", "Female", "Female", "Female",
"Male", "Female", "Female", "Male", "Male", "Female", "Male",
"Male", "Female", "Male", "Female", "Female", "Male", "Male",
"Male", "Female", "Female", "Male", "Male", "Female", "Male",
"Female", "Male", "Female", "Female", "Female", "Male", "Male",
"Female", "Male", "Male", "Male", "Female", "Male", "Female",
"Male", "Male", "Male", "Female", "Female", "Female", "Female",
"Male", "Female", "Male", "Male", "Female", "Female", "Male",
"Male", "Male", "Male", "Female", "Male", "Male", "Female", "Female",
"Male", "Male", "Male", "Male", "Female", "Female", "Male", "Male",
"Female", "Male", "Male", "Male", "Female", "Female", "Female",
"Female", "Male", "Female", "Female", "Female", "Male", "Female",
"Female", "Female", "Male", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Male", "Female", "Male", "Male",
"Female", "Male", "Female", "Female", "Male", "Female", "Male",
"Male", "Female", "Female", "Female", "Male", "Female", "Female",
"Male", "Female", "Male", "Female", "Female", "Male", "Female",
"Female", "Male", "Female", "Male", "Male", "Female", "Female",
"Female", "Female", "Female", "Male", "Female", "Female", "Female",
"Female", "Female", "Male", "Female", "Male", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Male", "Male",
"Male", "Female", "Female", "Female", "Female", "Female", "Male",
"Male", "Female", "Female", "Female", "Male", "Female", "Male",
"Female", "Male", "Female", "Female", "Male", "Female", "Male",
"Male", "Female", "Male", "Female", "Female", "Male", "Female",
"Female", "Male", "Female", "Female", "Female", "Male", "Male",
"Male", "Female", "Female", "Female", "Female", "Male"), Maritaldummy = c("Not married",
"Married", "Married", "Not married", "Not married", "Married",
"Married", "Married", "Not married", "Not married", "Not married",
"Married", "Married", "Married", "Married", "Married", "Not married",
"Not married", "Not married", "Married", "Not married", "Not married",
"Not married", "Not married", "Not married", "Married", "Married",
"Not married", "Married", "Not married", "Married", "Not married",
"Not married", "Not married", "Not married", "Not married", "Married",
"Not married", "Married", "Married", "Not married", "Not married",
"Married", "Not married", "Married", "Not married", "Not married",
"Not married", "Married", "Married", "Married", "Not married",
"Not married", "Married", "Married", "Not married", "Not married",
"Married", "Married", "Not married", "Married", "Married", "Married",
"Not married", "Married", "Not married", "Not married", "Married",
"Not married", "Married", "Not married", "Not married", "Not married",
"Married", "Not married", "Not married", "Married", "Married",
"Not married", "Married", "Married", "Married", "Married", "Married",
"Married", "Not married", "Married", "Not married", "Not married",
"Not married", "Not married", "Not married", "Married", "Not married",
"Married", "Married", "Not married", "Not married", "Married",
"Not married", "Married", "Married", "Married", "Married", "Not married",
"Married", "Married", "Married", "Not married", "Married", "Not married",
"Not married", "Married", "Not married", "Married", "Not married",
"Not married", "Married", "Not married", "Married", "Not married",
"Married", "Married", "Not married", "Married", "Married", "Married",
"Not married", "Married", "Married", "Married", "Married", "Married",
"Married", "Married", "Married", "Not married", "Not married",
"Not married", "Married", "Married", "Married", "Not married",
"Married", "Not married", "Married", "Not married", "Married",
"Married", "Married", "Married", "Married", "Not married", "Married",
"Not married", "Not married", "Married", "Married", "Married",
"Married", "Married", "Married", "Married", "Married", "Not married",
"Married", "Married", "Married", "Not married", "Not married",
"Married", "Not married", "Married", "Not married", "Married",
"Married", "Not married", "Not married", "Married", "Not married",
"Married", "Not married", "Not married", "Married", "Not married",
"Not married", "Married", "Married", "Married", "Not married",
"Not married", "Not married", "Married", "Married", "Married",
"Married", "Not married", "Not married", "Married", "Not married")), row.names = c("3564", "2999", "20144", "17281", "11917",
"14549", "5116", "10553", "23108", "19521", "277", "24312", "5449",
"19006", "9171", "21265", "20494", "11961", "15556", "12237",
"10959", "23460", "14050", "13996", "16222", "21852", "5593",
"18871", "18770", "776", "24913", "7813", "25079", "1063", "22878",
"13638", "19169", "7226", "14895", "8088", "19789", "22835",
"14196", "13816", "7124", "10394", "8290", "16807", "732", "3130",
"16033", "14958", "7500", "15039", "1538", "12532", "2890", "18907",
"21581", "3120", "20198", "22943", "8468", "3128", "24153", "22911",
"6225", "8489", "13040", "17506", "14855", "1500", "11955", "24484",
"17625", "19888", "10351", "19210", "22946", "14699", "1959",
"6770", "23286", "11842", "12811", "22197", "5899", "10138",
"20505", "16090", "17835", "20512", "12271", "9152", "12767",
"25244", "16865", "6970", "10036", "22531", "12329", "15366",
"2", "9440", "2100", "23166", "11421", "18912", "4441", "25202",
"20599", "411", "12584", "1586", "4543", "1307", "10044", "25033",
"5005", "25122", "16236", "9653", "16194", "14393", "7512", "10059",
"12010", "1619", "3136", "24088", "14641", "19564", "9568", "18815",
"21079", "22010", "9553", "20380", "20416", "15745", "7000",
"7735", "24924", "15286", "20403", "4680", "13714", "13302",
"12508", "17514", "4480", "7446", "3723", "24069", "25317", "14607",
"12274", "21715", "8983", "23488", "9228", "7265", "18192", "16475",
"11760", "15530", "18177", "11535", "18839", "17908", "9789",
"18045", "1025", "21645", "11853", "22453", "18052", "22763",
"9", "12286", "15329", "3306", "13215", "16533", "18385", "23784",
"10131", "4894", "14154", "3365", "8648", "17325", "21219", "16689",
"9969", "10621", "24206", "19621", "8440", "19889"), class = "data.frame")
Upvotes: 1
Views: 1839
Reputation: 388862
We can reproduce the error if you change any one value to NA
in the column.
library(dplyr)
library(ggplot2)
df$Maritaldummy[195] <- NA
df %>%
mutate(date = lubridate::mdy(startday)) %>%
arrange(date) %>%
mutate(Rs = cumsum(Maritaldummy %in% c("Not married", "Married")),
Married_Rs = cumsum(Maritaldummy == "Married")) %>%
group_by(date) %>%
slice(n()) %>%
select(date, Rs, Married_Rs) %>%
mutate(Married_prop = Married_Rs/Rs) %>%
ggplot(aes(x = date, y = Married_prop)) +
geom_point() +
geom_line()
Returns
Warning messages: 1: Removed 38 rows containing missing values (geom_point). 2: Removed 38 row(s) containing missing values (geom_path).
Since one or more than one value is NA
cumsum
fails and returns NA
for all the values after that. An easy fix is to use %in%
instead of ==
which returns FALSE
when compared to NA
.
df %>%
mutate(date = lubridate::mdy(startday)) %>%
arrange(date) %>%
mutate(Rs = cumsum(Maritaldummy %in% c("Not married", "Married")),
Married_Rs = cumsum(Maritaldummy %in% "Married")) %>%
group_by(date) %>%
slice(n()) %>%
select(date, Rs, Married_Rs) %>%
mutate(Married_prop = Married_Rs/Rs) %>%
ggplot(aes(x = date, y = Married_prop)) +
geom_point() +
geom_line()
Upvotes: 1