Reputation: 77
I am importing a series of surveys as .csv files and combining into one data set. The problem is for one of the seven files some of the variables are importing slightly differently. The data set is huge and I would like to find a way to write a function to run over dataset that is giving me trouble.
In some of the variables there is an underscore when there should be a dot. Not all variables are of the same format but the ones that are incorrect are, in that the underscore is always the 6th element of the column name.
I want R to look for the 6th element and if it is an underscore replace it with a dot. here is a made up example below.
col_names <- c("s1.help_needed",
"s1.Q2_im_stuck",
"s1.Q2.im_stuck",
"s1.Q3.regex",
"s1.Q3_regex",
"s2.Q1.is_confusing",
"s2.Q2.answer_please",
"s2.Q2_answer_please",
"s2.someone_knows_the answer",
"s3.appreciate_the_help")
I assume there is a Regex answer to this but i am struggling to find one. perhaps there is also a tidyr answer?
Upvotes: 2
Views: 1482
Reputation: 521259
As @thelatemail pointed out, none of your data actually has underscores in the fifth position, but some have it in the sixth position (where others have dot). A base R approach would be to use gsub()
:
result <- gsub("^(.{5})_", "\\1.", col_names)
> result
[1] "s1.help_needed" "s1.Q2.im_stuck"
[3] "s1.Q2.im_stuck" "s1.Q3.regex"
[5] "s1.Q3.regex" "s2.Q1.is_confusing"
[7] "s2.Q2.answer_please" "s2.Q2.answer_please"
[9] "s2.someone_knows_the answer" "s3.appreciate_the_help"
Here is an explanation of the regex:
^ from the start of the string
(.{5}) match AND capture any five characters
_ followed by an underscore
The quantity in parentheses is called a capture group and can be used in the replacement via \\1
. So the regex is saying replace the first six characters with the five characters we captured but use a dot as the sixth character.
Upvotes: 6
Reputation: 263352
You can use a "capture-class" defined by the first 4 (actually 5) characters of any sort followed by an underscore and replace with whatever those 5 characters were was followed a "dot". Since all the examples had the underscore in the 6th position, I'm guessing you were not counting the original "dots":
> col_names
[1] "s1.help_needed" "s1.Q2_im_stuck"
[3] "s1.Q2.im_stuck" "s1.Q3.regex"
[5] "s1.Q3_regex" "s2.Q1.is_confusing"
[7] "s2.Q2.answer_please" "s2.Q2_answer_please"
[9] "s2.someone_knows_the answer" "s3.appreciate_the_help"
> sub("^(.....)_", "\\1.", col_names)
[1] "s1.help.needed" "s1.Q2.im_stuck"
[3] "s1.Q2.im.stuck" "s1.Q3.regex"
[5] "s1.Q3.regex" "s2.Q1.is.confusing"
[7] "s2.Q2.answer.please" "s2.Q2.answer_please"
[9] "s2.someone.knows_the answer" "s3.appreciate.the_help"
Since the replacement argument does not have the same issues with escapes, you do not need to use the doubled backslashes as you might have used in an R-regex pattern argument.
Upvotes: 4