RacktheMan
RacktheMan

Reputation: 65

How to add rows to dataframe by combining values of already existing rows with R dplyr

I have the following table, which i would like to modify:

type position ratio number percentage
DNA intergenic 0.00026933362 225173 40.757876065
DNA intragenic 0.00021799943 41250 7.466536342
LINE intergenic 0.00027633335 48619 8.800376494
LINE intragenic 0.00031015097 9578 1.733684487

I want to add rows that contain the following modifications:

  1. type: if the value "type" is identical between two rows (it always is in my case), add it again in a separate row of the column "type".

  2. position: change the value from intergenic/intragenic to "genome" if (1)

  3. ratio: ratio value would be the weighted mean calculated from the ratio of intergenic and intragenic rows of the same type value:
    ((number_intragenic * ratio_intragenic) + (number_intergenic * ratio_intergenic))/(number_intragenic + number_intergenic)

  4. number: sum of number values for the same type: sum(number_intergenic + number_intragenic)

  5. sum of the percentage values for the same type: sum(percentage_intergenic + percentage_intragenic)

My problem is that I do not know how to add rows to dataframe by making specific calculation from already existing rows. It is easy to add columns using mutate in dplyr. How can I do this for rows?

I would much prefer if the solution is provided in dplyr.

Edit: The formula of the weighted mean was wrong. I had added a + sign instead of a * sign in the following part of the formula: (number_intergenic + ratio_intergenic). It has now been fixed.

Upvotes: 2

Views: 82

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21938

Dedicated to dear @akrun who taught me how to do this:

library(dplyr)
library(purrr)
library(tibble)

df %>%
  group_split(type) %>%
  map_dfr(~ add_row(.x, type = first(.x$type), position = "genome", 
                    ratio = ((.x$ratio[2]*.x$number[2]) + (.x$ratio[1]+.x$number[1])) / 
                      (.x$number[1]+.x$number[2]), number = .x$number[1] + .x$number[2], 
                    percentage = .x$percentage[1] + .x$percentage[2]))


# A tibble: 6 x 5
  type  position      ratio number percentage
  <chr> <chr>         <dbl>  <int>      <dbl>
1 DNA   intergenic 0.000269 225173      40.8 
2 DNA   intragenic 0.000218  41250       7.47
3 DNA   genome     0.845    266423      48.2 
4 LINE  intergenic 0.000276  48619       8.80
5 LINE  intragenic 0.000310   9578       1.73
6 LINE  genome     0.835     58197      10.5 

Upvotes: 2

Dan Chaltiel
Dan Chaltiel

Reputation: 8523

Here is a solution.

dplyr is not meant to create new rows as the result is often not considered "tidy". In your case, you want to summarise the data (I used a custom function to this end). You can use the summary on its own or you can use bind_rows() to add it to the original table.

Of note, you might want to check that number_intragenic and its friends has a length of 1, it may prevent some unexpected errors, for instance with stopifnot(length(number_intragenic)==1.

library(tidyverse)
df = read.table(header=T, text="
type    position    ratio   number  percentage
DNA intergenic  0.00026933362   225173  40.757876065
DNA intragenic  0.00021799943   41250   7.466536342
LINE    intergenic  0.00027633335   48619   8.800376494
LINE    intragenic  0.00031015097   9578    1.733684487
")

get_ratio = function(number, ratio, position){
  number_intragenic = number[position=="intragenic"]
  ratio_intragenic = ratio[position=="intragenic"]
  number_intergenic = number[position=="intergenic"]
  ratio_intergenic = ratio[position=="intergenic"]
  ((number_intragenic * ratio_intragenic) + (number_intergenic + ratio_intergenic))/(number_intragenic+number_intergenic)
}

df_summary = df %>% 
  group_by(type) %>% 
  summarise(ratio=get_ratio(number, ratio, position), 
            number=sum(number), 
            percentage=sum(percentage)) %>% 
  mutate(position="genome", .after="type")

bind_rows(df, df_summary) %>% 
  arrange(type)
#>   type   position        ratio number percentage
#> 1  DNA intergenic 0.0002693336 225173  40.757876
#> 2  DNA intragenic 0.0002179994  41250   7.466536
#> 3  DNA     genome 0.8452047787 266423  48.224412
#> 4 LINE intergenic 0.0002763333  48619   8.800376
#> 5 LINE intragenic 0.0003101510   9578   1.733684
#> 6 LINE     genome 0.8354721189  58197  10.534061

Created on 2021-05-12 by the reprex package (v2.0.0)

Upvotes: 1

Related Questions