Reputation: 1
I have two data frames, split
and nsD
. Inside of nsD
, I have two columns: one containing numeric data (nsD$counts
) and other containing an 8-character ID tag (nsD$ID
).
counts | ID |
---|---|
12 | hb123400 |
31 | hb123400 |
22 | hb123400 |
48 | aa339800 |
13 | aa339800 |
9 | aa339800 |
Inside of split
, each column name corresponds to the values in nsD$ID
.
hb123400 | aa399800 |
---|---|
0 | 0 |
0 | 0 |
What I need to do is move the values of nsD$counts
for which nsD$ID
matches the individual column names in split
, and I need to be able to do this for all 551 columns/IDs easily. I'm honestly unsure of how to get there; I know for-loop containing which()
is likely involved, but I'm unsure of how to proceed. (If it helps, I also have a chr list IDlist
containing the unique IDs.)
Desired outcome:
hb123400 | aa399800 |
---|---|
12 | 48 |
31 | 13 |
22 | 9 |
Thanks for any help!
ETA: I have successfully moved a small portion of this data using the following code:
nsD$counts[which(nsD$ID=="2982b018")] -> split$2982b018
This moves one small (1881 values, out of 1mil+) portion of data to the right column, but as you can see, I'm having to individually designate which ID tag to look for.
Upvotes: 0
Views: 226
Reputation: 887881
We don't need to create an empty datasets of 0s and fill it. Suppose, we want only some 'ID' elements from 'nsD', filter
those (here it is based on the column names of 'split', but it can be just a vector
of names), then we use pivot_wider
from tidyr
to reshape from 'long' to 'wide'. As there are duplicate elements, make sure that we create a sequence id to make it unique (rowid
from data.table
- or use group_by(ID) %>% mutate(rn = row_number())
in dplyr)
library(dplyr)
library(tidyr)
library(data.table)
out <- nsD %>%
filter(ID %in% names(split)) %>%
mutate(rn = rowid(ID)) %>%
pivot_wider(names_from = ID, values_from = counts) %>%
select(-rn)
-output
out
# A tibble: 3 x 2
hb123400 aa339800
<int> <int>
1 12 48
2 31 13
3 22 9
Note that when we are not loading packages correctly, there can be side-effects i.e. filter
is one function which is present in base R
stats
. So, if we didn't load the dplyr
with library(dplyr)
, this can be result in filter
being used from the stats
package
nsD %>%
+ stats::filter(ID %in% names(split)) %>%
+ mutate(rn = rowid(ID)) %>%
+ pivot_wider(names_from = ID, values_from = counts) %>%
+ select(-rn)
Error in ID %in% names(split) : object 'ID' not found
Generally, with tidyverse
packages, there are some function names that are also present in other packages i.e. summarise/mutate
can get masked with plyr::summarise/plyr::mutate
. A better practise is to either use ::
or create a unique name by assignment
nsD %>%
dplyr::filter(ID %in% names(split)) %>%
dplyr::mutate(rn = rowid(ID)) %>%
tidyr::pivot_wider(names_from = ID, values_from = counts) %>%
dplyr::select(-rn)
Or may use
filter_d <- dplyr::filter
mutate_d <- dplyr::mutate
select_d <- dplyr::select
nsD %>%
filter_d(ID %in% names(split)) %>%
mutate_d(rn = rowid(ID)) %>%
tidyr::pivot_wider(names_from = ID, values_from = counts) %>%
select_d(-rn)
nsD <- structure(list(counts = c(12L, 31L, 22L, 48L, 13L, 9L), ID = c("hb123400",
"hb123400", "hb123400", "aa339800", "aa339800", "aa339800")), class = "data.frame", row.names = c(NA,
-6L))
split <- structure(list(hb123400 = c(0L, 0L), aa339800 = c(0L, 0L)), class = "data.frame", row.names = c(NA,
-2L))
NOTE: split
is a function name. It is better not to name objects with already existing functions.
Upvotes: 2