Reputation: 841
I need to subset a df to include certain strings. Some of these are full column names, and the following works fine:
testData[,c("FullColName1","FullColName2","FullColName3")]
My problem is that I need to expand this to also include column names that contain specific strings that may partially match to some other column names. These strings include letters and symbols:
"PartString1()","PartString2()"
I tried putting wildcards around these. (I've indicated this below with the prefix "star" because the "*" symbol didn't render correctly.)
testData[ ,c("FullColName1","FullColName2","FullColName3",
"starPartString1()star","starPartString2()star")]
But I'm getting an error message: "undefined columns selected"
. I can't figure out if or how I need grep()
to make this work.
Upvotes: 22
Views: 52273
Reputation: 2210
The question asked is how to retrieve specific column names, given only partial strings. Let me offer a simple grepl
solution.
#example data
df <- data.frame(col1_sse = paste0(1:5, LETTERS[1:5]),
col2_swl = runif(5, max = 10),
col3_sdz = runif(5, max = 1000),
col4_swl = paste0(letters[1:5]))
#assume partial names are complex
partial_names <- c("2_sw", "sdz")
#create a "keepers" list of column names
keepers <- names(df)[grepl(paste0(partial_names, collapse = "|"), names(df))]
#use "keepers" to extract cols from original data
new_df <- df[,keepers]
Upvotes: 0
Reputation: 99331
You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]]
as our regular expression. This will find all the strings with punctuation symbols in the column names.
d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
"PartString1()","PartString2()")
d[grepl("[[:punct:]]", names(d))]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1
This last part just illustrates another way to do this with other string processing functions from stringr
library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
# PartString1() PartString2()
# 1 13 1
# 2 12 1
# 3 11 1
ADD per OPs comment
d[grepl("ring[12()]", names(d))]
to get either of the substrings ring1()
or ring2()
from the names vector
Upvotes: 19
Reputation: 3597
You can use grep
to find indices of column names with partial match to a particular pattern
require(PerformanceAnalytics)
data(managers)
colnames(managers)
#[1] "HAM1" "HAM2" "HAM3" "HAM4" "HAM5"
#[6] "HAM6" "EDHEC LS EQ" "SP500 TR" "US 10Y TR" "US 3m TR"
suppose the pattern you want to match is "HAM", along with some fixed column names ("SP500 TR" "US 10Y TR" "US 3m TR")
head(managers[,c("SP500 TR","US 10Y TR","US 3m TR",colnames(managers)[grep("HAM",colnames(managers))])])
# SP500 TR US 10Y TR US 3m TR HAM1 HAM2 HAM3 HAM4 HAM5 HAM6
#1996-01-31 0.0340 0.00380 0.00456 0.0074 NA 0.0349 0.0222 NA NA
#1996-02-29 0.0093 -0.03532 0.00398 0.0193 NA 0.0351 0.0195 NA NA
#1996-03-31 0.0096 -0.01057 0.00371 0.0155 NA 0.0258 -0.0098 NA NA
#1996-04-30 0.0147 -0.01739 0.00428 -0.0091 NA 0.0449 0.0236 NA NA
#1996-05-31 0.0258 -0.00543 0.00443 0.0076 NA 0.0353 0.0028 NA NA
#1996-06-30 0.0038 0.01507 0.00412 -0.0039 NA -0.0303 -0.0019 NA NA
you can specify multiple patterns using, grep("pattern1 | pattern2 ", colnames(data))
Upvotes: 10
Reputation: 42639
You can use grepl
for a search by column name. It returns a logical vector indicating matches.
Here is an example:
d <- read.table(header=TRUE, check.names=FALSE,
text="1PartString()2 1PartString()3 OtherCol
1 2 3
3 4 5")
d
## 1PartString()2 1PartString()3 OtherCol
## 1 1 2 3
## 2 3 4 5
d[,grepl("PartString\\(\\)", names(d))]
## 1PartString()2 1PartString()3
## 1 1 2
## 2 3 4
grepl
check to see if the pattern is present anywhere in the name, so a wildcard is not required.
Upvotes: 4