Chetan Arvind Patil
Chetan Arvind Patil

Reputation: 866

Trim Data Based on First Character of Column Name

I have a data set with multiple columns. Using R I want to keep only those column that have first character as T to create a subset as shown in output data below.

Input Data

T1234 T5678 T9101112 A B D E
  1     2       3    4 5 6 7
  1     2       3    4 5 6 7
  1     2       3    4 5 6 7
  1     2       3    4 5 6 7
  1     2       3    4 5 6 7
  1     2       3    4 5 6 7
  1     2       3    4 5 6 7

Output Data

T1234 T5678 T9101112
  1     2       3   
  1     2       3   
  1     2       3   
  1     2       3   
  1     2       3   
  1     2       3   
  1     2       3   

Any suggestion how this can be achieved? Thanks.

Upvotes: 0

Views: 63

Answers (3)

Mako212
Mako212

Reputation: 7292

In base R using RegEx

df <- data.frame(T1234=rep(1,7),T5678=2,T9101112=3,A=4,B=5,D=6,E=7)

df[,grepl("^T",names(df))]

The regex pattern ^T matches T at the beginning of each row name. You could refine the pattern to ^T\\d+ if you wanted to match just "T" followed by 1 or more numbers, as another example.

Also note that ^ asserts that you're at the beginning of the string. Without it you'd match "AT912340" because it contains a T.

For multiple characters (i.e. columns that start with T or M) we'd use the "or" operator |

df[,grepl("^T|M",names(df))]

And to match groups of characters like RDY or MTP we'd do it like this:

df[,grepl("^T|MTP|Check|RDY",names(df))]

Note: in the comments I mistakenly used brackets like so: [T,M]. Using brackets tells RegEx to match one of the characters in the brackets, so in this case it would match "T", "M", or ",". Obviously we don't want to match a comma here, and it's syntactically incorrect to have the commas in the brackets separating each character. To match "T" or "M" the correct syntax with brackets would be [TM], however, to match words, or short strings like above, we must use | as the "or" operator.

Upvotes: 2

agstudy
agstudy

Reputation: 121568

Another solution without using regex :

df[,substr(names(df),1,1) %in% c("T","M")]

Upvotes: 0

user1533380
user1533380

Reputation:

> require(dplyr)
> df <- data.frame(T1234=rep(1,7),T5678=2,T9101112=3,A=4,B=5,D=6,E=7)
> df
  T1234 T5678 T9101112 A B D E
1     1     2        3 4 5 6 7
2     1     2        3 4 5 6 7
3     1     2        3 4 5 6 7
4     1     2        3 4 5 6 7
5     1     2        3 4 5 6 7
6     1     2        3 4 5 6 7
7     1     2        3 4 5 6 7
> select(df,starts_with('T'))
  T1234 T5678 T9101112
1     1     2        3
2     1     2        3
3     1     2        3
4     1     2        3
5     1     2        3
6     1     2        3
7     1     2        3
> 

Or, without dplyr

> df[,grepl('T',colnames(df))]
     T1234 T5678 T9101112
1     1     2        3
2     1     2        3
3     1     2        3
4     1     2        3
5     1     2        3
6     1     2        3
7     1     2        3
> 

but the latter will hit the T in any position.

Upvotes: 1

Related Questions