Reputation: 7
I have a dataframe of 286 columns and 157355 rows. I wish to subset rows that contain one or more of several defined factor variables such as F32, F341 etc. Once this has been completed, I wish to identify which other factor variables are most common in the subset rows.
I have tried to filter for values of interest but an error messages appears saying the data must be numeric, logical or complex, for example;
d<- a %>%
filter_at(vars(f.41202.0.0:f.41202.0.65), all_vars('F32'))
I also tried this, but the resulting dataframe had no values present;
f <- a %>%
rowwise() %>%
filter(any(c(1:280) %in% c('F32', 'F320', 'F321', 'F322', 'F323',
'F328', 'F329', 'F330', 'F331', 'F332',
'F333', 'F334', 'F338', 'F339')))
the same occurred when I tried to place all relevant variables into an ICD object;
f <- b %>%
rowwise() %>%
filter(any(c(1:286) %in% ICD))
I would greatly appreciate any suggestions, thanks
my data looks like this (sorry I can't find a way to format it better on this page);
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
2 NA NA M87 NA
3 NA F3 NA K17
4 NA NA F3 M87
After sub-setting rows based on F3 it should look like this;
Row.name Var1 Var2 Var3 Var4
1 F3 NA NA M87
3 NA F3 NA K17
4 NA NA F3 M87
so the same variable columns are retained, but rows without F3 are removed
then I would hope to list the other variables (other than F3) based on how common they are within that subset, in this case that would be
most common: M87
2nd most common: K17
If it helps, I am trying to identify individuals with a particular disease, then I will try to find out which other diseases those individuals most commonly have
thanks for the help
Upvotes: 0
Views: 444
Reputation: 30474
If you wish to use tidyverse
, you can use filter_all
to look at all of the columns. Then, check if any_vars
are in a vector of diagnostic codes. In my example, I look at F3 and F320.
Afterwards, if you want to count up the number of diagnosis codes, you could reshape your data from wide to long, and then count frequencies. If you wish, you can remove NA
by filter. Let me know if this is what you had in mind.
df <- data.frame(
Var1 = c("F3", NA, NA, NA),
Var2 = c(NA, NA, "F3", NA),
Var3 = c(NA, "M87", NA, "F3"),
Var4 = c("M87", NA, "K17", "M87")
)
library(tidyverse)
df %>%
filter_all(any_vars(. %in% c("F3", "F320"))) %>%
pivot_longer(cols = starts_with("Var"), names_to = "Var", values_to = "Code") %>%
filter(!is.na(Code)) %>%
count(Code, sort = TRUE)
After the filter, you should have:
Var1 Var2 Var3 Var4
1 F3 <NA> <NA> M87
2 <NA> F3 <NA> K17
3 <NA> <NA> F3 M87
After pivot_longer
and count
:
# A tibble: 3 x 2
Code n
<fct> <int>
1 F3 3
2 M87 2
3 K17 1
Side note: if you wish to filter based on only some of your variables (instead of selecting all variables), you can use filter_at
instead, such as:
filter_at(vars(starts_with("Var")), any_vars(. %in% c("F3", "F320")))
Upvotes: 0