Grouping Data then filling a column in R based on the grouped data

Question

For any combination of columns A,C,D,E,F, and G, I am trying to find the column B value when it is within 5% of the value in column D. Once found I would like to paste that value in a new column wherever the combination is.

Here is a sample of the data I am working with:

structure(list(A = c(500L, 10000L, 5000L, 500L, 100L, 500L, 1000L, 
10000L, 5000L, 1000L, 500L, 5000L, 100L, 5000L, 500L, 500L, 500L, 
1000L, 10000L, 500L), B = c(1.53147891704226, 5.51999984066968, 
1.69897000433602, 3.49996186559619, 2.8668778143375, 2.27415784926368, 
2.69983772586725, 4.30000820255381, 4.28000895310819, 1.14612803567824, 
3.40001963506516, 4.88000138832177, 2.3747483460101, 4, 3.03342375548695, 
3.04999285692014, 2.59988307207369, 3.51666755909904, 4.40000234592796, 
2.82477646247555), C = c(0.118917162666339, 32.46875, 0.00120927734375, 
6.69645182291667e-06, 38.1009114583333, 0.03888505859375, 0.984812890625, 
181.953125, 0.0079256796875, 0.0397203010315885, 1.693359375, 
0.25630859375, 0.00419210611979167, 1.4658203125, 0.00764973958333333, 
0.294973113716194, 8.8974609375, 0.0014642802734375, 67.609375, 
0.00205580344395639), D = c(4.63125661725864, 34.1632795742744, 
0.262987871586425, 9.53427792464916e-06, 38.7106620745277, 0.187395038620314, 
0.99014163328848, 211.108639904501, 0.0108561099088211, 9.82604248822947, 
1.95692192890506, 0.262987871586425, 0.00616933538501461, 2.23297962243741, 
0.020686261349356, 0.53228350287947, 26.4570757028734, 0.00221508528097736, 
68.1735822402243, 0.00495578134094092), E = c(2, 2, 2, 100, 2, 
100, 2, 2, 100, 2, 2, 2, 2, 100, 100, 2, 2, 100, 100, 2), F = c(1e-05, 
1e-06, 1e-07, 1e-08, 1e-05, 1e-06, 1e-04, 1e-05, 1e-06, 1e-05, 
1e-06, 1e-07, 1e-07, 1e-07, 1e-08, 1e-06, 1e-06, 1e-06, 1e-05, 
1e-08), G = c("Effective Number of Haplotypes", "Number of Polymorphic Sites", 
"Gene Diversity", "Nucleotide Diversity", "Number of Heterozygotes", 
"Gene Diversity", "Gene Diversity", "Number of Polymorphic Sites", 
"Nucleotide Diversity", "Effective Number of Haplotypes", "Number of Haplotypes", 
"Gene Diversity", "Gene Diversity", "Number of Haplotypes", "Number of Polymorphic Sites", 
"Effective Number of Haplotypes", "Number of Heterozygotes", 
"Nucleotide Diversity", "Number of Heterozygotes", "Effective Number of Haplotypes"
)), .Names = c("A", "B", "C", "D", "E", "F", "G"), row.names = c("11025", 
"13649", "37612", "178511", "9864", "15883", "2469", "7104", 
"15089", "11140", "18719", "47812", "36151", "31315", "66810", 
"17609", "16501", "14975", "10860", "45318"), class = "data.frame")

The working line of code I have is:

min(df[which(df$C>=(0.05*df$D) & df$G == 'Nucleotide Diversity' & df$F==1e-6 & df$A==5000 & df$E==100),]$B) This returns the one number I want for the combination of columns A,C,D,E,F, and G.

Issue/Question 1: I am stuck on pasting this number in a new column, H where all the combinations of A,C,D,E,F, and G, can be found.

Issue/Question 2: Is there an automated way to do this without plugging in values for df$G== , df$F== , df$A==, and df$E==?

Ideal Output

A       B         C           D          E   F                   G                        H
500 1.531479 1.189172e-01 4.631257e+00   2   1e-05     Effective Number of Haplotypes
10000 5.520000 3.246875e+01 3.416328e+01   2   1e-06    Number of Polymorphic Sites
5000 1.698970 1.209277e-03 2.629879e-01   2   1e-07                 Gene Diversity
5000 3.499962 6.696452e-06 9.534278e-06  100   1e-06           Nucleotide Diversity         4.280009
100 2.866878 3.810091e+01 3.871066e+01   2   1e-05        Number of Heterozygotes
500 2.274158 3.888506e-02 1.873950e-01  100   1e-06                 Gene Diversity
1000 2.699838 9.848129e-01 9.901416e-01   2   1e-04                 Gene Diversity
10000 4.300008 1.819531e+02 2.111086e+02   2   1e-05    Number of Polymorphic Sites
5000 4.280009 7.925680e-03 1.085611e-02  100   1e-06           Nucleotide Diversity        4.280009
...

I would assume there is a way to do this but the term I should search for isn't clear to me.

Ronak Shah · Accepted Answer

Do you mean something like this ?

library(dplyr)
df %>% group_by(A, G) %>% mutate(H = min(B[C >= 0.05 * D]))

This returns the minimum value of B where C value is greater than equal to 5% of D. Maybe you want to add more variables in group_by.

Grouping Data then filling a column in R based on the grouped data

Answers (1)

Related Questions