Reputation: 621
I have two data frames,
df1
Identifier GSE1028888 GSE1034555
100002 0.1 0.2
100003 0.3 0.4
...... ..... .....
100007 0.9 1.1
df2
V3 V2
100002 XLX12
100003 ABorF
...... .....
110000 GEF22
Now I want insert the V2 information into df1, such as
df3
Identifier New_V2 GSE1028888 GSE1034555
100002 XLX12 0.1 0.2
100003 ABorF 0.3 0.4
100004 NA 0.6 0.7
...... ..... .....
100007 ccL34 0.9 1.1
The V3 of df2 and Identifier of df1 have different length. I try dplyr left_join, but the column is attached at the end.
This is the code to create similar data frame
df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F),
y = sample(rnorm(10), 6, replace = T),
z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010),
V2 = sample(state.name, 10, replace = T))
This time when I try dplyr left_join,
left_join(df1, df2, by =c("Identifier"="V3"))
an error message was shown
Error: cannot join on columns 'V3' x 'Identifier': index out of bounds
Anyone has any idea?
Upvotes: 0
Views: 191
Reputation: 5193
Using the data you provided:
df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F),
y = sample(rnorm(10), 6, replace = T),
z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010),
V2 = sample(state.name, 10, replace = T))
merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)
V1 V2 y z
1 100001 Wisconsin -1.9468637 0.4509951
2 100002 Nebraska 2.5155819 0.4509951
3 100003 Ohio NA NA
4 100004 Massachusetts NA NA
5 100005 Montana -1.9468637 -2.1825878
6 100006 Illinois -0.1591367 0.3445637
7 100007 New Mexico -0.5696300 0.4509951
8 100008 New Jersey NA NA
9 100009 Tennessee NA NA
10 100010 Washington -1.9468637 -0.5402241
And you can toggle the order which ever way you like but the above output is what you requested -no need to reorder but if wanted to:
merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)[c(2,3,4,1)]
V2 y z V1
1 Wisconsin -1.9468637 0.4509951 100001
2 Nebraska 2.5155819 0.4509951 100002
3 Ohio NA NA 100003
4 Massachusetts NA NA 100004
5 Montana -1.9468637 -2.1825878 100005
6 Illinois -0.1591367 0.3445637 100006
7 New Mexico -0.5696300 0.4509951 100007
8 New Jersey NA NA 100008
9 Tennessee NA NA 100009
10 Washington -1.9468637 -0.5402241 100010
As per the OP's posted comments below, here is an example where an identifier exists in df1 that does not exist in df2
df1 <- data.frame("Identifier" = sample(100001:100012, 6, replace = F),
y = sample(rnorm(10), 6, replace = T),
z = sample(rnorm(10), 6, replace = T))
df1
Identifier y z
1 100011 -1.60532712 1.365836073
2 100007 -1.28821500 0.005925986
3 100004 -0.03444609 0.780708952
4 100006 0.32190045 0.780708952
5 100009 -1.60532712 -1.471916384
6 100005 -0.76985033 0.191956916
df2 <- data.frame(V1 = c(100001:100010),
V2 = sample(state.name, 10, replace = T))
df2
V1 V2
1 100001 Pennsylvania
2 100002 West Virginia
3 100003 Utah
4 100004 Alaska
5 100005 Ohio
6 100006 Mississippi
7 100007 New Hampshire
8 100008 New Jersey
9 100009 Ohio
10 100010 Georgia
merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE, all.y=TRUE)
V1 V2 y z
1 100001 Pennsylvania NA NA
2 100002 West Virginia NA NA
3 100003 Utah NA NA
4 100004 Alaska -0.03444609 0.780708952
5 100005 Ohio -0.76985033 0.191956916
6 100006 Mississippi 0.32190045 0.780708952
7 100007 New Hampshire -1.28821500 0.005925986
8 100008 New Jersey NA NA
9 100009 Ohio -1.60532712 -1.471916384
10 100010 Georgia NA NA
11 100011 <NA> -1.60532712 1.365836073
Upvotes: 1
Reputation: 133
One solution is to merge and reorder them (unfortunately not a one-function answer).
Setup
df1 = data.frame(i=c(1,2,3), GSE111=c(4,5,6), GSE222=c(7,8,9))
df2 = data.frame(i=c(1,3,4), v2=c(10,11,12))
This produces:
> df1
i GSE111 GSE222
1 1 4 7
2 2 5 8
3 3 6 9
4 4 7 10
> df2
i v2
1 1 10
2 3 11
3 4 12
Merge:
I've used built-in merge function instead of plyr.
df3 = merge(x=df1, y=df2, by="i", all.x=TRUE)
this will yield:
i GSE111 GSE222 v2
1 1 4 7 10
2 2 5 8 NA
3 3 6 9 11
Reorder:
Now you can reorder by literally typing column names:
df4 = df3[,c("i", "v2", "GSE111", "GSE222")]
which yields
i v2 GSE111 GSE222
1 1 10 4 7
2 2 NA 5 8
3 3 11 6 9
Which is essentially the product you wanted.
Shortcut:
Or, if you are lazy like me, I can generate unique column names between df1 and df2 like this:
unique(c(colnames(df2), colnames(df1)))
This yields:
[1] "i" "v2" "GSE111" "GSE222"
So you can type something like:
df3[,unique(c(colnames(df2), colnames(df1)))]
Upvotes: 4