Reputation: 65
I have the following table, which i would like to modify:
type | position | ratio | number | percentage |
---|---|---|---|---|
DNA | intergenic | 0.00026933362 | 225173 | 40.757876065 |
DNA | intragenic | 0.00021799943 | 41250 | 7.466536342 |
LINE | intergenic | 0.00027633335 | 48619 | 8.800376494 |
LINE | intragenic | 0.00031015097 | 9578 | 1.733684487 |
I want to add rows that contain the following modifications:
type: if the value "type" is identical between two rows (it always is in my case), add it again in a separate row of the column "type".
position: change the value from intergenic/intragenic to "genome" if (1)
ratio: ratio value would be the weighted mean calculated from the ratio of intergenic and intragenic rows of the same type value:
((number_intragenic * ratio_intragenic) + (number_intergenic * ratio_intergenic))/(number_intragenic + number_intergenic)
number: sum of number values for the same type: sum(number_intergenic + number_intragenic)
sum of the percentage values for the same type: sum(percentage_intergenic + percentage_intragenic)
My problem is that I do not know how to add rows to dataframe by making specific calculation from already existing rows. It is easy to add columns using mutate in dplyr. How can I do this for rows?
I would much prefer if the solution is provided in dplyr.
Edit: The formula of the weighted mean was wrong. I had added a + sign instead of a * sign in the following part of the formula: (number_intergenic + ratio_intergenic). It has now been fixed.
Upvotes: 2
Views: 82
Reputation: 21938
Dedicated to dear @akrun who taught me how to do this:
library(dplyr)
library(purrr)
library(tibble)
df %>%
group_split(type) %>%
map_dfr(~ add_row(.x, type = first(.x$type), position = "genome",
ratio = ((.x$ratio[2]*.x$number[2]) + (.x$ratio[1]+.x$number[1])) /
(.x$number[1]+.x$number[2]), number = .x$number[1] + .x$number[2],
percentage = .x$percentage[1] + .x$percentage[2]))
# A tibble: 6 x 5
type position ratio number percentage
<chr> <chr> <dbl> <int> <dbl>
1 DNA intergenic 0.000269 225173 40.8
2 DNA intragenic 0.000218 41250 7.47
3 DNA genome 0.845 266423 48.2
4 LINE intergenic 0.000276 48619 8.80
5 LINE intragenic 0.000310 9578 1.73
6 LINE genome 0.835 58197 10.5
Upvotes: 2
Reputation: 8523
Here is a solution.
dplyr
is not meant to create new rows as the result is often not considered "tidy". In your case, you want to summarise the data (I used a custom function to this end). You can use the summary on its own or you can use bind_rows()
to add it to the original table.
Of note, you might want to check that number_intragenic
and its friends has a length of 1, it may prevent some unexpected errors, for instance with stopifnot(length(number_intragenic)==1
.
library(tidyverse)
df = read.table(header=T, text="
type position ratio number percentage
DNA intergenic 0.00026933362 225173 40.757876065
DNA intragenic 0.00021799943 41250 7.466536342
LINE intergenic 0.00027633335 48619 8.800376494
LINE intragenic 0.00031015097 9578 1.733684487
")
get_ratio = function(number, ratio, position){
number_intragenic = number[position=="intragenic"]
ratio_intragenic = ratio[position=="intragenic"]
number_intergenic = number[position=="intergenic"]
ratio_intergenic = ratio[position=="intergenic"]
((number_intragenic * ratio_intragenic) + (number_intergenic + ratio_intergenic))/(number_intragenic+number_intergenic)
}
df_summary = df %>%
group_by(type) %>%
summarise(ratio=get_ratio(number, ratio, position),
number=sum(number),
percentage=sum(percentage)) %>%
mutate(position="genome", .after="type")
bind_rows(df, df_summary) %>%
arrange(type)
#> type position ratio number percentage
#> 1 DNA intergenic 0.0002693336 225173 40.757876
#> 2 DNA intragenic 0.0002179994 41250 7.466536
#> 3 DNA genome 0.8452047787 266423 48.224412
#> 4 LINE intergenic 0.0002763333 48619 8.800376
#> 5 LINE intragenic 0.0003101510 9578 1.733684
#> 6 LINE genome 0.8354721189 58197 10.534061
Created on 2021-05-12 by the reprex package (v2.0.0)
Upvotes: 1