Reputation: 25
I am using R to process a data.frame
; one column has a certain mixture of letters and numbers, I want to put a comma between a pattern of characters:
Input:
arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3
arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat
arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat
Desired output:
arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3
arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat
arr 11p15.5(2097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat
Basically, I want to put a comma after the first (xxx-xxx)x1
(here could be x1,x2,x3, then there could be a "mat", "pat" after x1).
Many thanks to MichaelChirico and Onyambu, I extracted more contents from that column,
Input 'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3', 'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat', 'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat', 'arr[hg19] Xp22.33p22.12(60701-21536551)x1~3 Xq21.31q28(90731177-155208244)x1 ish', 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)', 'nuc ish(D21S259/D21S341/D21S342x3).arr(21)x310q26.12(121812494-122486677)x1'
output 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3', 'arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat', 'arr 11p15.5(2097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat', 'arr[hg19] Xp22.33p22.12(60701-21536551)x1~3, Xq21.31q28(90731177-155208244)x1 ish', 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)', 'nuc ish(D21S259/D21S341/D21S342x3).arr(21)x3,10q26.12(121812494-122486677)x1'
I am trying to us the following code, but works for all the situation,
x <- c( 'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3', 'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat', 'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat', 'arr[hg19] Xp22.33p22.12(60701-21536551)x1~3 Xq21.31q28(90731177-155208244)x1 ish', 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)', 'nuc ish(D21S259/D21S341/D21S342x3).arr(21)x310q26.12(121812494-122486677)x1' ) sub(pattern = '([)]x[1|2|3|1~2|1~3]\s[mat|pat|dn]?))', replacement = '\1,', x=x)
Upvotes: 0
Views: 90
Reputation: 826
Can do the following
x <- c(
'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3',
'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat',
'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat'
)
sub(pattern = "([(][0-9]+-[0-9]+[)]x[0-9])([^[:space:]].*)", replacement = "\\1,\\2", x = x)
Here is a brief explanation:
1) Regexp for matchin (xxx-xxx)x1
is [(][0-9]+-[0-9]+[)]x[0-9]
, here I used []
instead of escaping to match (
. Rest can be read as numerics any number of times [0-9]+
followed by -
followed by numerics any number of times [0-9]+
followed by )
, x
and a digit [0-9]
.
2) Using capturing groups to split string and concat later, we split string on non whitespace character followed any number of characters ([^[:space:]].*)
, so that the pattern in 1 is in first group and the rest is in second. And concatenate 2 groups adding ,
like this "\\1,\\2"
Upvotes: 0
Reputation: 79238
sub("(\\).*?)(\\d{2}[a-z])","\\1,\\2",x)
[1] "arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3"
[2] "arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat"
[3] "arr 11p15.52097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat"
Upvotes: 1
Reputation: 34703
You said
I want to put a comma after the first
(xxx-xxx)x1
But your third case contradicts this. Until you clarify your rule for substitution, you can try, for your vector of strings x
,
sub('([(][0-9]{7}-[0-9]{7}[)]x[0-9])', '\\1,', x)
You might also want to replace [0-9]
with \\d
, which is slightly more robust to locale:
sub('([(]\\d{7}-\\d{7}[)]x\\d)', '\\1,', x)
To accommodate the mat
in the third attempt, you might try:
sub('([(]\\d{7}-\\d{7}[)]x\\d(\\smat)?)', '\\1,', x)
But this is highly custom-tailored to fit exactly your example.
Upvotes: 0