cmirian
cmirian

Reputation: 2253

r: how to simultaneously change multiple column names based on the individual suffix of each column name

I have received a datasheet p autogenerated from a registry and containing 1855 columns. The autogeneration adds _vX automatically to each column name where X correspond the number of follow-ups. Unfortunately, this creates ridiculously long column names.

Eg

p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10 and p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20

correspond to the 10th and 20th MRI scan on the same patient. I.e., each column that addresses clinical parameters related to the 10th follow-up ends with v1_v2_v3_v4_v5_v6_v7_v8_v9_v10.

I seek a solution, preferably in dplyr or a function, that changes the entire _v1_v2_...." suffix to fuX corresponding to the xth follow-up.

Lets say that p looks like:

  a_v2 b_v2_v3 a_v2_v3_v4 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1    0       1          1                                                                        1                                                                        0
2    1       1          0                                                                        1                                                                        0

Expected output:

> p
  a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
1     0     1     1      1      0
2     1     1     0      1      0

Data

p <- structure(list(dia_maxrd_v2 = c(0, 1), hear_sev_v2_v3 = c(1, 1), reop_ind_v2_v3_v4___1 = c(1, 
0), neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(1, 
1), symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(0, 
0)), class = "data.frame", row.names = c(NA, -2L))

EDIT

To complicate things, some column names end with "___1" indicating a specific parameter relating to that clinical parameter and should be preserved, e.g.: _v1_v2_v3_v4___1. Hence, this is still to be considered as fu4 and the ___1 part should not be omitted.

  a_v2 b_v2_v3 a_v2_v3_v4___1 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1    0       1              1                                                                        1                                                                        0
2    1       1              0                                                                        1                                                                        0                                                                      

Expected output:

> p
  a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
1     0     1     1          1      0
2     1     1     0          1      0

EDIT

My apologies, the solution must consider the "basic" column name specifying what parameter the column contain, e.g. post-surgical complications. It is only the _v1_v2_v3..._vX-part that should be substituted with the corresponding fuX. What comes before and after the _v1_v2_v3..._vX-part must be preserved.

Consider

  dia_maxrd_v2 hear_sev_v2_v3 reop_ind_v2_v3_v4___1 neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1            0              1                     1                                                                                1                                                                                     0
2            1              1                     0                                                                                1                                                                         0             
                                                                                                                                        

Expected output:

> p
  dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20 symp_pre_lokal_fu20
1             0            1                1              1          0
2             1            1                0              1              0

Upvotes: 1

Views: 72

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173793

You can use gsub with two capturing groups:

names(p) <- gsub("^(.).*?(\\d+)$", "\\1_fu\\2", names(p))

p
#>   a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
#> 1     0     1     1      1      0
#> 2     1     1     0      1      0

EDIT

With new requirements stipulated by OP for including in pipe in having some different endings not in original question:

p %>% setNames(gsub("^(.).*?(\\d+_*\\d*)$", "\\1_fu\\2", names(.)))
#>   a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
#> 1     0     1         1      1      0
#> 2     1     1         0      1      0

EDIT

For arbitrary starting strings, it may be easiest to gsub twice:

p %>% setNames(gsub("(\\d{1,2}_v)+", "", names(.))) %>%
      setNames(gsub("_v(\\d+)", "_fu\\1", names(.)))

#>   dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20
#> 1             0            1                1              1
#> 2             1            1                0              1
#>   symp_pre_lokal_fu20
#> 1                   0
#> 2                   0

Upvotes: 2

Related Questions