Reputation: 2253
I have received a datasheet p
autogenerated from a registry and containing 1855 columns. The autogeneration adds _vX
automatically to each column name where X
correspond the number of follow-ups. Unfortunately, this creates ridiculously long column names.
Eg
p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10
and p$MRI_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
correspond to the 10th and 20th MRI scan on the same patient. I.e., each column that addresses clinical parameters related to the 10th follow-up ends with v1_v2_v3_v4_v5_v6_v7_v8_v9_v10
.
I seek a solution, preferably in dplyr
or a function
, that changes the entire _v1_v2_...."
suffix to fuX
corresponding to the xth follow-up.
Lets say that p
looks like:
a_v2 b_v2_v3 a_v2_v3_v4 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
Data
p <- structure(list(dia_maxrd_v2 = c(0, 1), hear_sev_v2_v3 = c(1, 1), reop_ind_v2_v3_v4___1 = c(1,
0), neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(1,
1), symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(0,
0)), class = "data.frame", row.names = c(NA, -2L))
EDIT
To complicate things, some column names end with "___1" indicating a specific parameter relating to that clinical parameter and should be preserved, e.g.: _v1_v2_v3_v4___1
. Hence, this is still to be considered as fu4
and the ___1
part should not be omitted.
a_v2 b_v2_v3 a_v2_v3_v4___1 b_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 a_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
1 0 1 1 1 0
2 1 1 0 1 0
EDIT
My apologies, the solution must consider the "basic" column name specifying what parameter the column contain, e.g. post-surgical complications. It is only the _v1_v2_v3..._vX
-part that should be substituted with the corresponding fuX
. What comes before and after the _v1_v2_v3..._vX
-part must be preserved.
Consider
dia_maxrd_v2 hear_sev_v2_v3 reop_ind_v2_v3_v4___1 neuro_def_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 symp_pre_lokal_v1_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
1 0 1 1 1 0
2 1 1 0 1 0
Expected output:
> p
dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20 symp_pre_lokal_fu20
1 0 1 1 1 0
2 1 1 0 1 0
Upvotes: 1
Views: 72
Reputation: 173793
You can use gsub
with two capturing groups:
names(p) <- gsub("^(.).*?(\\d+)$", "\\1_fu\\2", names(p))
p
#> a_fu2 b_fu3 a_fu4 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
With new requirements stipulated by OP for including in pipe in having some different endings not in original question:
p %>% setNames(gsub("^(.).*?(\\d+_*\\d*)$", "\\1_fu\\2", names(.)))
#> a_fu2 b_fu3 a_fu4___1 b_fu20 a_fu20
#> 1 0 1 1 1 0
#> 2 1 1 0 1 0
EDIT
For arbitrary starting strings, it may be easiest to gsub
twice:
p %>% setNames(gsub("(\\d{1,2}_v)+", "", names(.))) %>%
setNames(gsub("_v(\\d+)", "_fu\\1", names(.)))
#> dia_maxrd_fu2 hear_sev_fu3 reop_ind_fu4___1 neuro_def_fu20
#> 1 0 1 1 1
#> 2 1 1 0 1
#> symp_pre_lokal_fu20
#> 1 0
#> 2 0
Upvotes: 2