Emily Robbs
Emily Robbs

Reputation: 1

Needing to move values of one column based on values of another matching column names in a separate data frame

I have two data frames, split and nsD. Inside of nsD, I have two columns: one containing numeric data (nsD$counts) and other containing an 8-character ID tag (nsD$ID).

counts ID
12 hb123400
31 hb123400
22 hb123400
48 aa339800
13 aa339800
9 aa339800

Inside of split, each column name corresponds to the values in nsD$ID.

hb123400 aa399800
0 0
0 0

What I need to do is move the values of nsD$counts for which nsD$ID matches the individual column names in split, and I need to be able to do this for all 551 columns/IDs easily. I'm honestly unsure of how to get there; I know for-loop containing which() is likely involved, but I'm unsure of how to proceed. (If it helps, I also have a chr list IDlist containing the unique IDs.)

Desired outcome:

hb123400 aa399800
12 48
31 13
22 9

Thanks for any help!

ETA: I have successfully moved a small portion of this data using the following code:

nsD$counts[which(nsD$ID=="2982b018")] -> split$2982b018

This moves one small (1881 values, out of 1mil+) portion of data to the right column, but as you can see, I'm having to individually designate which ID tag to look for.

Upvotes: 0

Views: 226

Answers (1)

akrun
akrun

Reputation: 887881

We don't need to create an empty datasets of 0s and fill it. Suppose, we want only some 'ID' elements from 'nsD', filter those (here it is based on the column names of 'split', but it can be just a vector of names), then we use pivot_wider from tidyr to reshape from 'long' to 'wide'. As there are duplicate elements, make sure that we create a sequence id to make it unique (rowid from data.table - or use group_by(ID) %>% mutate(rn = row_number()) in dplyr)

library(dplyr)
library(tidyr)
library(data.table)
out <- nsD %>%
    filter(ID %in% names(split)) %>%
    mutate(rn = rowid(ID)) %>% 
    pivot_wider(names_from = ID, values_from = counts) %>%
    select(-rn)

-output

out
# A tibble: 3 x 2
  hb123400 aa339800
     <int>    <int>
1       12       48
2       31       13
3       22        9

Note that when we are not loading packages correctly, there can be side-effects i.e. filter is one function which is present in base R stats. So, if we didn't load the dplyr with library(dplyr), this can be result in filter being used from the stats package

 nsD %>%
+     stats::filter(ID %in% names(split)) %>%
+     mutate(rn = rowid(ID)) %>% 
+     pivot_wider(names_from = ID, values_from = counts) %>%
+     select(-rn)
Error in ID %in% names(split) : object 'ID' not found

Generally, with tidyverse packages, there are some function names that are also present in other packages i.e. summarise/mutate can get masked with plyr::summarise/plyr::mutate. A better practise is to either use :: or create a unique name by assignment

nsD %>%
     dplyr::filter(ID %in% names(split)) %>%
     dplyr::mutate(rn = rowid(ID)) %>% 
     tidyr::pivot_wider(names_from = ID, values_from = counts) %>%
     dplyr::select(-rn)

Or may use

filter_d <- dplyr::filter
mutate_d <- dplyr::mutate
select_d <- dplyr::select
nsD %>%
     filter_d(ID %in% names(split)) %>%
     mutate_d(rn = rowid(ID)) %>% 
     tidyr::pivot_wider(names_from = ID, values_from = counts) %>%
     select_d(-rn)

data

nsD <- structure(list(counts = c(12L, 31L, 22L, 48L, 13L, 9L), ID = c("hb123400", 
"hb123400", "hb123400", "aa339800", "aa339800", "aa339800")), class = "data.frame", row.names = c(NA, 
-6L))

split <- structure(list(hb123400 = c(0L, 0L), aa339800 = c(0L, 0L)), class = "data.frame", row.names = c(NA, 
-2L))

NOTE: split is a function name. It is better not to name objects with already existing functions.

Upvotes: 2

Related Questions