dplyr update original data frame with new values after subset

Question

I am trying to change the value of entries in a dataframe column based on a series of conditions. I need to change the 'group' values of the top (or bottom) 10 entries of a certain type.

My data is in a dataframe that looks like this:

> head(diff_df_min)
  external_gene_name   gene_biotype  Fold  p.value     group
1      RP11-431K24.1        lincRNA -4.13 4.86e-06 signif_fc
2              UBE4B protein_coding  2.42 3.91e-06 signif_fc
3             UBIAD1 protein_coding  2.74 5.58e-05 signif_fc
4             PTCHD2 protein_coding  3.37 2.68e-06 signif_fc
5             DRAXIN protein_coding  3.04 1.42e-06 signif_fc
6             VPS13D protein_coding  4.26 1.60e-07 signif_fc

> dim(diff_df_min)
[1] 1824    5

I have figured out this solution with dplyr:

diff_df_min %>%
        filter(gene_biotype == "protein_coding") %>% # subset for protein coding genes
        arrange(-Fold, p.value) %>% # Sort by Fold change, then by p value
        slice(1:10) %>% # take the top 10 entries... 
        mutate(group = "top_signif_fc") # ... and change the "group" column value to "top_signif_fc"

This gives the exact results I want:

   external_gene_name   gene_biotype Fold  p.value         group
1               CROCC protein_coding 5.46 3.44e-14 top_signif_fc
2               KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
3             PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
4                RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
5             HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
6              SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
7               DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
8            ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
9              RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10              CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc

Except this is not updating these values in the original dataframe, it is only showing the results after applying the functions. Similarly, I have tried to do the same in data.table and figured out this method:

setDT(diff_df_min,key = "external_gene_name")
diff_df_min[gene_biotype == "protein_coding"][order(-Fold, p.value), head(.SD, 10)][,group := "top_signif_fc"]

But again this only RETURNS the results, it does not update the original dataframe.

    external_gene_name   gene_biotype Fold  p.value         group
 1:              CROCC protein_coding 5.46 3.44e-14 top_signif_fc
 2:              KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
 3:            PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
 4:               RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
 5:            HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
 6:             SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
 7:              DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
 8:           ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
 9:             RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10:              CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc

You can see this when you check the values in the data frame after running any of these commands (or run a subset of the commands again):

> diff_df_min[which(diff_df_min['external_gene_name'] == "CROCC"),]
    external_gene_name   gene_biotype Fold  p.value     group
372              CROCC protein_coding 5.46 3.44e-14 signif_fc

And of course, if you try to use either of the methods like this:

diff_df_min <- ...

You end up overwriting the original dataframe with only the 10 lines that were selected with dplyr or data.table.

I had previously been doing similar things in base R, but could not get this case to work. I tried it, and ended up with this, which is ridiculous and does not work correctly:

diff_df_min[with(diff_df_min[which(diff_df_min['gene_biotype'] == "protein_coding"),], order(-Fold, p.value) ),"group"][1:top_gene_number] <- "top_signif_fc"

^^ Somewhere along the way, the indexes get messed up and so the entries that are eventually changed are not the intended entries.

I have read dozens and dozens of pages about this so far, including many tutorials and even this but so far I have been unable to find anything that actually gives a solution for this. I don't want to simply print out a modified dataframe, I want to update the original dataframe entries with the new entries.

akrun · Accepted Answer

We can use an ifelse statement to make the changes instead of slice to subset it and also replace the filter (that removes the rows) with arrangeing based on the "protein_coding" as well, and also assign the the output back to the original dataset or to a new

diff_df_minNew <- diff_df_min %>%
                     arrange(desc(gene_biotype == "protein_coding"), 
                                desc(Fold), p.value) %>% 
                     mutate(group = ifelse(row_number() < 11, "top_signif_fc", group))

A corresponding option using data.table would be

library(data.table)
diff_df_minNew2 <- setDT(diff_df_min)[order(-(gene_biotype=="protein_coding"),
      -Fold, p.value)][seq_len(10), group := "top_signif_fc"][]

dplyr update original data frame with new values after subset

Answers (1)

Related Questions