rnorouzian
rnorouzian

Reputation: 7517

subsetting from same-named data.frame in R

I have a data.frame called c41 (HERE). Some column names (e.g., type) in this data frame are repeated once or twice. As a result, data.frame adds a ".number" suffix to distinguish between them.

Suppose I want to subset variable type == 3 among all column names that have a "type" root in their names. Currently, I drop the ".number" suffixes and then subset but that incorrectly returns nothing.

Question: In BASE R, how can I subset a variable value (type == 3) without needing to include the ".number" suffixes (e.g., type == 3 instead of type.1 == 3)?

In other words, how can I find any "type" whose value is 3 regardless of its numeric suffix.

c41 <- read.csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")

c42 <- setNames(c41, sub("\\.\\d+$", "", names(c41))) # Take off the `".number"` suffixes

subset(c42, type == 3) # Now subset ! But it return nothing!

Upvotes: 0

Views: 118

Answers (2)

lroha
lroha

Reputation: 34406

Renaming the columns to make them non-unique is a recipe for a headache and is not advisable. Without renaming the columns, in base R you could do something like this instead:

c41[rowSums(c41[grep("^type", names(c41))] == 3, na.rm = TRUE) > 0,]

I don't think subset() can be used here if column names are duplicated.

Upvotes: 2

neilfws
neilfws

Reputation: 33782

EDIT: I see that you edited your question to specify base R. Can't help you there! But perhaps a dplyr solution is of interest.

You can use dplyr::filter_at and the starts_with helper.

library(dplyr)
library(readr)

c4 <- read_csv("https://raw.githubusercontent.com/izeh/l/master/c4.csv")
c4 %>% 
  filter_at(vars(starts_with("type")), any_vars(. == 3))

Adding a select_at to display just the relevant columns:

c4 %>% 
  filter_at(vars(starts_with("type")), any_vars(. == 3)) %>% 
  select_at(vars(starts_with("type")))

Result:

# A tibble: 2 x 2
   type type_1
  <dbl>  <dbl>
1     1      3
2     2      3

Upvotes: 1

Related Questions