Manuel
Manuel

Reputation: 155

Error: Data source must be a dictionary (dplyr)

Although there are more columns and observations, my dataframe looks like the following:

dt <- data.frame(hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
                     syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
                     employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
                                    "Full-time", "Full-time", "Full-time", "Unemployed", "Part-time", 
                                    "Full-time"),
                     relhead = c("Head", "Head", "Head", "Partner", "other", "Head", 
                                                  "Partner", "Head", "Partner", "Head", "Partner")) 

| hid | syear |  employlvl  |       relhead         |
|-----|-------|-------------|-----------------------|
|  1  | 2000  |  Full-time  |         Head          |
|  2  | 2001  |  Part-time  |         Head          |
|  2  | 2003  |  Part-time  |         Head          |
|  2  | 2003  |  Unemployed |        Partner        |
|  2  | 2003  |  Unemployed |         other         |
|  4  | 2000  |  Full-time  |         Head          |
|  4  | 2000  |  Full-time  |        Partner        |
|  4  | 2001  |  Full-time  |         Head          |
|  4  | 2001  |  Unemployed |        Partner        |
|  4  | 2002  |  Part-time  |         Head          |
|  4  | 2002  |  Full-time  |        Partner        |

I would like to create another column which indicates the employmentlevel of the Partner and hope to get the following output:

| hid | syear |  employlvl  |         relhead       |      Partner      |
|-----|-------|-------------|-----------------------|-------------------|
|  1  | 2000  |  Part-time  |         Head          |        NA         |
|  2  | 2001  |  Part-time  |         Head          |        NA         |
|  2  | 2003  |  Part-time  |         Head          |    Unemployed     |
|  2  | 2003  |  Unemployed |       Partner         |        NA         |
|  2  | 2003  |  Unemployed |         other         |        NA         |
|  4  | 2000  |  Full-time  |         Head          |     Full-time     |
|  4  | 2000  |  Full-time  |        Partner        |        NA         |
|  4  | 2001  |  Full-time  |         Head          |    Unemployed     |
|  4  | 2001  |  Unemployed |        Partner        |        NA         |
|  4  | 2002  |  Part-time  |         Head          |     Full-time     |
|  4  | 2002  |  Full-time  |        Partner        |        NA         |

Currently, I am using the following code:

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  group_by(hid, syear) %>%
  filter(n() > 1) %>%
  filter(`relhead` != "Child") %>%
  spread(relhead, employlvl) %>%
  mutate(Relation = "Head") %>%
  rename(`Employment Partner` = Partner) %>%
  select(-Head)

dt3 <- dt %>%
  left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))

The code works absolutely fine for this small data set. But as soon as I try for my whole data, I get the following:

Error: Data source must be a dictionary

Upvotes: 15

Views: 10046

Answers (6)

Marek
Marek

Reputation: 50704

As stated in other answers this is caused by non unique names. I was able to reproduce error by modifying your example (third element of relhead)

dt <- data.frame(
  hid = c(1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4),
  syear = c(2000, 2001, 2003, 2003, 2003, 2000, 2000, 2001, 2001, 2002, 2002),
  employlvl = c("Full-time", "Part-time", "Part-time", "Unemployed", "Unemployed",
     "Full-time", "Full-time", "Full-time", "Unemployed", "Part-time", 
     "Full-time"),
  relhead = c("Head", "Head", "Employment Partner", "Partner", "other", "Head", 
     "Partner", "Head", "Partner", "Head", "Partner")
) 

In that case spread creates first "Employment Partner" column and rename creates second. You should check if any of "Employment Partner", "Relation" (and maybe hid, syear) is in dt$relhead (first one gives you error, second one is overwrite by mutate(Relation=...)).

Minimal reproducible example:

data_frame(g = c("a1","a2","a3"), i=1) %>%
    spread(g, i) %>%
    rename(a1 = a3) %>%
    select(-a1)

Important update: now we have more informative error message:

Error in `rename()`:
! Names must be unique.
✖ These names are duplicated:
  * "Employment Partner" at locations 3 and 6.
Run `rlang::last_trace()` to see where the error occurred.

Upvotes: 6

Grace
Grace

Reputation: 131

After carefully checking my data set, I found that there are two columns having the same name. After I renamed one of them, then it works with no errors.

Upvotes: 13

jsg51483
jsg51483

Reputation: 193

The issue (I believe) is the difference in behavior between identically named functions in plyr, and dplyr. So when you have them both loaded, you can get unexpected results. I see this with group_by and summarize as well.

Generally, the best way I've found to deal with this is to just use dplyr::select, dplyr::rename, et cetera.

What'd be even better would just be to not use plyr, because dplyr has it covered at this point, but I have some legacy code that uses plyr, so I'm reticent to mess around with it.

Upvotes: 0

AnnaZ
AnnaZ

Reputation: 146

I got the same error message when I inattentively used 2 the same new names in rename() statement of dplyr package. Compare names(df2) with unique(names(df2)) as you might have already had the same variable name before.

Upvotes: 3

Brent Brewington
Brent Brewington

Reputation: 468

It's being caused by doing the select(-variable) after the rename call. I got the same error and when I removed the "rename" call and did the same select(-variable) it worked.

Have no clue why this is the case, but that's the trigger for the error.

Upvotes: 1

www
www

Reputation: 39154

If the error only occurs after you ran select(-Head), you can probably find a workaround by using base R command to achieve the same thing.

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  group_by(hid, syear) %>%
  filter(n() > 1) %>%
  filter(`relhead` != "Child") %>%
  spread(relhead, employlvl) %>%
  mutate(Relation = "Head") %>%
  rename(`Employment Partner` = Partner)

The above part is the same as the original code. After that, run the following.

dt2$Head <- NULL

This is a base R command to remove the Head column, which is the same thing select(-Head) wants to do.

And then you can run the rest of the code to join the data frames.

dt3 <- dt %>%
  left_join(dt2, by = c("hid", "syear", "relhead" = "Relation"))

Since you did not provide a reproducible example, we could not figure out what this error message really means, but perhaps this workaround can help you achieve your task for now.

Upvotes: 1

Related Questions