plast1cd0nk3y
plast1cd0nk3y

Reputation: 418

How do I use the across function across all columns of a dataframe rather than specify certain columns?

I have the following script so far that successfully creates a new column in my dataframe and populates it with a sum of how many times the value "TRUE" appears for each row of the dataframe:

data_1 <- data_1 %>% mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE)))

You will notice after I bring in the across function, I specify that I want to drop a column from the function. However, I actually don't want to drop any columns from my function. I tried writing something like

across(data_1 %in%, TRUE) to indicate I want to go across the whole dataframe/all columns, but this is not the correct syntax.

Also, I tried to do this a much simpler way using just rowSums and no mutate as follows:

data_1$True_Count <- rowSums(df == TRUE) but all this did was create an empty column called True_Count and did not count the occurrences of TRUE logical values in each row. I also tried the same thing using a random string value that I know occurs exactly one time in my dataset: data_1$True_Count <- rowSums(df == "banana") but this did the same thing -- it created an empty column and did not count the instance of banana in my dataset.

Lastly there was one more behavior that I did not understand. If I run the first code, data_1 <- data_1 %>% mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE))) more than once, the counts in the True_Count column cease to be correct.

Upvotes: 0

Views: 241

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388972

It is really helpful if you share data in a reproducible format with the expected output so that everyone is on the same page regarding understanding of the question.

Since you did not share an example, I created one myself to explain the answer here. I have added 4 random columns with TRUE/FALSE values since it seems this is what your dataset contains.

data_1 <- data.frame(Community.name = c(T, F, T, F, F), 
                     Community.code = c(T, F, F, T, T), 
                     col1 = T, 
                     col2 = c(F, F, T, F, F))
data_1

#  Community.name Community.code col1  col2
#1           TRUE           TRUE TRUE FALSE
#2          FALSE          FALSE TRUE FALSE
#3           TRUE          FALSE TRUE  TRUE
#4          FALSE           TRUE TRUE FALSE
#5          FALSE           TRUE TRUE FALSE

Note that TRUE (logical) is different from "TRUE" (character). So first verify if your dataset contains logical values or character values before trying out the answers below.


This is your current code where you are dropping Community.name and calculating number of TRUE values in the dataset.

library(dplyr)

data_2 <- data_1 %>% 
      mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE)))
data_2

#  Community.name Community.code col1  col2 True_Count
#1           TRUE           TRUE TRUE FALSE          2
#2          FALSE          FALSE TRUE FALSE          1
#3           TRUE          FALSE TRUE  TRUE          2
#4          FALSE           TRUE TRUE FALSE          2
#5          FALSE           TRUE TRUE FALSE          2

Seems to work as expected. We ignore Community.Name and calculate number of TRUE values in the dataset.


Now your question,

I actually don't want to drop any columns from my function.

For that you can use everything() in across to include all the columns.

data_3 <- data_1 %>% 
           mutate(True_Count = rowSums(across(everything(), `%in%`, TRUE)))

data_3

#  Community.name Community.code col1  col2 True_Count
#1           TRUE           TRUE TRUE FALSE          3
#2          FALSE          FALSE TRUE FALSE          1
#3           TRUE          FALSE TRUE  TRUE          3
#4          FALSE           TRUE TRUE FALSE          2
#5          FALSE           TRUE TRUE FALSE          2

Also note that everything() is default in ?across.


Also, I tried to do this a much simpler way using just rowSums and no mutate

Yes, using rowSums with no mutate is much simpler way giving the same answer.

data_1$True_Count <- rowSums(data_1)
data_1

#  Community.name Community.code col1  col2 True_Count
#1           TRUE           TRUE TRUE FALSE          3
#2          FALSE          FALSE TRUE FALSE          1
#3           TRUE          FALSE TRUE  TRUE          3
#4          FALSE           TRUE TRUE FALSE          2
#5          FALSE           TRUE TRUE FALSE          2

Lastly there was one more behavior that I did not understand. If I run the first code, more than once, the counts in the True_Count column cease to be correct.

That might be because initially you don't have True_Count column in the dataset. So for the first time when you run the code True_Count column is added in your dataset data_1, now when you run the code second time it also uses True_Count for calculation which is something you don't want.

Upvotes: 1

Related Questions