How to add rows to dataframe by combining values of already existing rows with R dplyr

Question

I have the following table, which i would like to modify:

type	position	ratio	number	percentage
DNA	intergenic	0.00026933362	225173	40.757876065
DNA	intragenic	0.00021799943	41250	7.466536342
LINE	intergenic	0.00027633335	48619	8.800376494
LINE	intragenic	0.00031015097	9578	1.733684487

I want to add rows that contain the following modifications:

type: if the value "type" is identical between two rows (it always is in my case), add it again in a separate row of the column "type".
position: change the value from intergenic/intragenic to "genome" if (1)
ratio: ratio value would be the weighted mean calculated from the ratio of intergenic and intragenic rows of the same type value:
((number_intragenic * ratio_intragenic) + (number_intergenic * ratio_intergenic))/(number_intragenic + number_intergenic)
number: sum of number values for the same type: sum(number_intergenic + number_intragenic)
sum of the percentage values for the same type: sum(percentage_intergenic + percentage_intragenic)

My problem is that I do not know how to add rows to dataframe by making specific calculation from already existing rows. It is easy to add columns using mutate in dplyr. How can I do this for rows?

I would much prefer if the solution is provided in dplyr.

Edit: The formula of the weighted mean was wrong. I had added a + sign instead of a * sign in the following part of the formula: (number_intergenic + ratio_intergenic). It has now been fixed.

Dan Chaltiel · Accepted Answer

Here is a solution.

dplyr is not meant to create new rows as the result is often not considered "tidy". In your case, you want to summarise the data (I used a custom function to this end). You can use the summary on its own or you can use bind_rows() to add it to the original table.

Of note, you might want to check that number_intragenic and its friends has a length of 1, it may prevent some unexpected errors, for instance with stopifnot(length(number_intragenic)==1.

library(tidyverse)
df = read.table(header=T, text="
type    position    ratio   number  percentage
DNA intergenic  0.00026933362   225173  40.757876065
DNA intragenic  0.00021799943   41250   7.466536342
LINE    intergenic  0.00027633335   48619   8.800376494
LINE    intragenic  0.00031015097   9578    1.733684487
")

get_ratio = function(number, ratio, position){
  number_intragenic = number[position=="intragenic"]
  ratio_intragenic = ratio[position=="intragenic"]
  number_intergenic = number[position=="intergenic"]
  ratio_intergenic = ratio[position=="intergenic"]
  ((number_intragenic * ratio_intragenic) + (number_intergenic + ratio_intergenic))/(number_intragenic+number_intergenic)
}

df_summary = df %>% 
  group_by(type) %>% 
  summarise(ratio=get_ratio(number, ratio, position), 
            number=sum(number), 
            percentage=sum(percentage)) %>% 
  mutate(position="genome", .after="type")

bind_rows(df, df_summary) %>% 
  arrange(type)
#>   type   position        ratio number percentage
#> 1  DNA intergenic 0.0002693336 225173  40.757876
#> 2  DNA intragenic 0.0002179994  41250   7.466536
#> 3  DNA     genome 0.8452047787 266423  48.224412
#> 4 LINE intergenic 0.0002763333  48619   8.800376
#> 5 LINE intragenic 0.0003101510   9578   1.733684
#> 6 LINE     genome 0.8354721189  58197  10.534061

^{Created on 2021-05-12 by the reprex package (v2.0.0)}

How to add rows to dataframe by combining values of already existing rows with R dplyr

Answers (2)

Related Questions