Fabrizio
Fabrizio

Reputation: 939

Tidy(verse) way to do a series of operation

I find myself writing a piece of code in a certain way but lately I started to wonder if I could make it better and more readable. My thought went to tidyverse.

Let me explain:

set.seed(123)
a_df=data.frame(sample=sample(c("A","B", "C"), 50, replace=TRUE),
                type= paste0(sample(letters[1:3], 50, replace=TRUE), sample(letters[1:3],50, replace=TRUE)),
                area=sample(1:100, 50, replace=TRUE) )

I have a dataframe similar to the one in the example. It has a column called "sample", one called "type" and another "area".

 head(a_df)
  sample type area
1      C   ac    9
2      C   ab   71
3      C   cb   98
4      B   ac   48
5      C   ba   77
6      B   aa   83

I need to compute some values with the third column for each sample but only for some specific types.

I define my types in "targets" and I use a double for loop.

Looping on the targets and on "sample".

NOTE: I use grep to select the lines in the temporary data frame "tdf". In my code, this passage is more complex. Instead of "grep" I have a function that takes 2 parameters, the target (a_targ) and the column "type" of "tdf" data frame. This passage with a function of 2 parameters must be conserved. I will generalize the answer with my function.

The idea of the computation is to count how many entries of a given "targets" type there are in a "sample" and divide this number by the sum of the "area" for these entries.

targets=c("ab", "bb")
all_densities=NULL
for(a_targ in targets){
  for(i in unique(a_df$sample)){
    tdf=a_df[a_df$sample==i, ]
    tdf=tdf[grep(a_targ, tdf$type),]
    a_dens=nrow(tdf)/sum(tdf$area)
    df_res=data.frame(sample=i, type=a_targ, density=a_dens)
    all_densities=rbind(all_densities, df_res)
  }
}


> head(all_densities)
  sample type    density
1      C   ab 0.01435407
2      B   ab 0.02500000
3      A   ab 0.01117318
4      C   bb 0.02068966
5      B   bb        NaN
6      A   bb 0.03658537

For instance, for sample "A" we can interrogate the data frame as follows:

a_df[a_df$sample=="A" & a_df$type=="bb",]
   sample type area
10      A   bb    1
21      A   bb   67
44      A   bb   14

The "density" is 3 (number of rows) divided by "1+67+14", and the results corresponds to 0.03658537 reported in "all_densities".

Would anybody be able to rewrite it with pipes, group_by, and in general in a more tidy way?

Upvotes: 1

Views: 63

Answers (1)

harre
harre

Reputation: 7297

You can do it with group_by and mutate (the filtering and arranging after ungroup is just to achieve comparable output to yours):

library(dplyr)

a_df |> 
  group_by(sample, type) |>
  summarise(a_dens = n()/sum(area)) |>
  ungroup() |>
  filter(type %in% c("ab", "bb")) |>
  arrange(type, desc(sample))

Output:

# A tibble: 5 × 3
  sample type  a_dens
  <chr>  <chr>  <dbl>
1 C      ab    0.0144
2 B      ab    0.025 
3 A      ab    0.0112
4 C      bb    0.0207
5 A      bb    0.0366

Update with after OP update / change of formula.

Upvotes: 2

Related Questions