Reputation: 2226
I have a dataframe of this nature generated with a dplyr summary function.
pos nuc sample total
23 A 10028_1#2 3
23 C 10028_1#2 1
23 G 10028_1#2 5129
23 T 10028_1#2 128
231 C 10028_1#2 4
231 T 10028_1#2 3123
.
.
A bar plot of this data with ggplot2 gives an 'uneven' bars because pos 231 is missing its A and G total values for the corresponding sample name. The values are missing and are generated by a program outside of R.
What would be an idiomatic way of inserting 0 totals for each missing value of A,T,G,C at each position for each corresponding value. In other words how do i get this dataframe?
pos nuc sample total
23 A 10028_1#2 3
23 C 10028_1#2 1
23 G 10028_1#2 5129
23 T 10028_1#2 128
231 C 10028_1#2 4
231 T 10028_1#2 3123
231 G 10028_1#2 0
231 A 10028_1#2 0
Upvotes: 2
Views: 112
Reputation: 887951
We can use complete
from tidyr
library(dplyr)
library(tidyr)
df1 %>%
complete(pos, nuc, nesting(sample), fill = list(total = 0))
# pos nuc sample total
# <int> <chr> <chr> <dbl>
#1 23 A 10028_1#2 3
#2 23 C 10028_1#2 1
#3 23 G 10028_1#2 5129
#4 23 T 10028_1#2 128
#5 231 A 10028_1#2 0
#6 231 C 10028_1#2 4
#7 231 G 10028_1#2 0
#8 231 T 10028_1#2 3123
Or we can use expand.grid/merge
from base R
transform(merge(expand.grid(lapply(df1[1:3], unique)),
df1, all.x=TRUE), total = replace(total, is.na(total), 0))
df1 <- structure(list(pos = c(23L, 23L, 23L, 23L, 231L, 231L),
nuc = c("A",
"C", "G", "T", "C", "T"), sample = c("10028_1#2", "10028_1#2",
"10028_1#2", "10028_1#2", "10028_1#2", "10028_1#2"), total = c(3L,
1L, 5129L, 128L, 4L, 3123L)), .Names = c("pos", "nuc", "sample",
"total"), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 2