Reputation: 93
I was given a spectroscopy reference database that is very different from the sample dataset I'm using. In my data, 0 or 1 indicate if there's a peak or not, while in the reference database the peak position is listed as a row value and grouped as a column in terms of peptide data (which I don't need).
My dataset looks like this:
Sample 1110 1111 1112
1 1 0 0
2 1 0 1
3 0 1 1
4 1 1 1
While the reference database obnoxiously looks like this (note the multiple values per column):
Species peptide1 peptide2 peptide3
cow 1110 1112 NA
sheep 1111 1112 NA
goat NA 1113 1114
The desired output would look similar to my dataset:
Species 1110 1111 1112 1113 1114
cow 1 0 1 0 0
sheep 0 1 1 0 0
goat 0 0 0 1 1
These are oversimplified but they illustrate the problem - let me know if I need to provide actual data. I need to both transpose/sort the values in the cells, while replacing the original values in a binary (again, I don't need to keep the peptide column names in the reference database). I'm really hoping there's an easy dplyr or tidyr trick for this - I imagine a spread function could work, but I wouldn't know how to do it for multiple columns nor how to leave the original data. Alternatively, I could manually append all of the data as a long format, then melt/cast it into a wider format?
Upvotes: 2
Views: 150
Reputation: 21908
I hope this is what you are looking for:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(!Species) %>%
mutate(val = 1) %>%
select(-name) %>%
drop_na() %>%
arrange(value) %>%
pivot_wider(names_from = value, values_from = val) %>%
mutate(across(!Species, ~ replace_na(., 0)))
# A tibble: 3 x 6
Species `1110` `1111` `1112` `1113` `1114`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cow 1 0 1 0 0
2 sheep 0 1 1 0 0
3 goat 0 0 0 1 1
Data
df <- tribble(
~Species, ~peptide1, ~peptide2, ~peptide3,
"cow", 1110, 1112, NA,
"sheep", 1111, 1112, NA,
"goat", NA, 1113, 1114
)
Simplifying the above syntax by using the really useful arguments of pivot_longer()
& pivot_wider()
as
df %>%
pivot_longer(!Species, values_drop_na = TRUE) %>%
pivot_wider(id_cols = Species, names_from = value, names_sort = TRUE, values_fill = 0, values_fn = length)
# A tibble: 3 x 6
Species `1110` `1111` `1112` `1113` `1114`
<chr> <int> <int> <int> <int> <int>
1 cow 1 0 1 0 0
2 sheep 0 1 1 0 0
3 goat 0 0 0 1 1
Upvotes: 6
Reputation: 388817
In data.table
you can use melt
and dcast
:
library(data.table)
dcast(melt(setDT(df), 1, na.rm = TRUE), Species~value, fun.aggregate = length)
# Species 1110 1111 1112 1113 1114
#1: cow 1 0 1 0 0
#2: goat 0 0 0 1 1
#3: sheep 0 1 1 0 0
Upvotes: 2