Reputation: 155
Although there are more columns and observations, my dataframe looks like the following:
dt <- data.frame(hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
"Full-time", "Full-time", "Full-time", "Unemployed", "Part-time",
"Full-time"),
relhead = c("Head", "Head", "Head", "Partner", "other", "Head",
"Partner", "Head", "Partner", "Head", "Partner"))
| hid | syear | employlvl | relhead |
|-----|-------|-------------|-----------------------|
| 1 | 2000 | Full-time | Head |
| 2 | 2001 | Part-time | Head |
| 2 | 2003 | Part-time | Head |
| 2 | 2003 | Unemployed | Partner |
| 2 | 2003 | Unemployed | other |
| 4 | 2000 | Full-time | Head |
| 4 | 2000 | Full-time | Partner |
| 4 | 2001 | Full-time | Head |
| 4 | 2001 | Unemployed | Partner |
| 4 | 2002 | Part-time | Head |
| 4 | 2002 | Full-time | Partner |
I would like to create another column which indicates the employmentlevel of the Partner and hope to get the following output:
| hid | syear | employlvl | relhead | Partner |
|-----|-------|-------------|-----------------------|-------------------|
| 1 | 2000 | Part-time | Head | NA |
| 2 | 2001 | Part-time | Head | NA |
| 2 | 2003 | Part-time | Head | Unemployed |
| 2 | 2003 | Unemployed | Partner | NA |
| 2 | 2003 | Unemployed | other | NA |
| 4 | 2000 | Full-time | Head | Full-time |
| 4 | 2000 | Full-time | Partner | NA |
| 4 | 2001 | Full-time | Head | Unemployed |
| 4 | 2001 | Unemployed | Partner | NA |
| 4 | 2002 | Part-time | Head | Full-time |
| 4 | 2002 | Full-time | Partner | NA |
Currently, I am using the following code:
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(hid, syear) %>%
filter(n() > 1) %>%
filter(`relhead` != "Child") %>%
spread(relhead, employlvl) %>%
mutate(Relation = "Head") %>%
rename(`Employment Partner` = Partner) %>%
select(-Head)
dt3 <- dt %>%
left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))
The code works absolutely fine for this small data set. But as soon as I try for my whole data, I get the following:
Error: Data source must be a dictionary
Upvotes: 15
Views: 10046
Reputation: 50704
As stated in other answers this is caused by non unique names. I was able to reproduce error by modifying your example (third element of relhead
)
dt <- data.frame(
hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
"Full-time", "Full-time", "Full-time", "Unemployed", "Part-time",
"Full-time"),
relhead = c("Head", "Head", "Employment Partner", "Partner", "other", "Head",
"Partner", "Head", "Partner", "Head", "Partner")
)
In that case spread
creates first "Employment Partner"
column and rename
creates second. You should check if any of "Employment Partner"
, "Relation"
(and maybe hid
, syear
) is in dt$relhead
(first one gives you error, second one is overwrite by mutate(Relation=...)
).
Minimal reproducible example:
data_frame(g = c("a1","a2","a3"), i=1) %>%
spread(g, i) %>%
rename(a1 = a3) %>%
select(-a1)
Important update: now we have more informative error message:
Error in `rename()`:
! Names must be unique.
✖ These names are duplicated:
* "Employment Partner" at locations 3 and 6.
Run `rlang::last_trace()` to see where the error occurred.
Upvotes: 6
Reputation: 131
After carefully checking my data set, I found that there are two columns having the same name. After I renamed one of them, then it works with no errors.
Upvotes: 13
Reputation: 193
The issue (I believe) is the difference in behavior between identically named functions in plyr
, and dplyr
. So when you have them both loaded, you can get unexpected results. I see this with group_by
and summarize
as well.
Generally, the best way I've found to deal with this is to just use dplyr::select
, dplyr::rename
, et cetera.
What'd be even better would just be to not use plyr
, because dplyr
has it covered at this point, but I have some legacy code that uses plyr
, so I'm reticent to mess around with it.
Upvotes: 0
Reputation: 146
I got the same error message when I inattentively used 2 the same new names in rename()
statement of dplyr
package. Compare names(df2)
with unique(names(df2))
as you might have already had the same variable name before.
Upvotes: 3
Reputation: 468
It's being caused by doing the select(-variable)
after the rename
call. I got the same error and when I removed the "rename" call and did the same select(-variable) it worked.
Have no clue why this is the case, but that's the trigger for the error.
Upvotes: 1
Reputation: 39154
If the error only occurs after you ran select(-Head)
, you can probably find a workaround by using base R command to achieve the same thing.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(hid, syear) %>%
filter(n() > 1) %>%
filter(`relhead` != "Child") %>%
spread(relhead, employlvl) %>%
mutate(Relation = "Head") %>%
rename(`Employment Partner` = Partner)
The above part is the same as the original code. After that, run the following.
dt2$Head <- NULL
This is a base R command to remove the Head
column, which is the same thing select(-Head)
wants to do.
And then you can run the rest of the code to join the data frames.
dt3 <- dt %>%
left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))
Since you did not provide a reproducible example, we could not figure out what this error message really means, but perhaps this workaround can help you achieve your task for now.
Upvotes: 1