puppetshow
puppetshow

Reputation: 25

How to put a comma between a pattern of characters?

I am using R to process a data.frame; one column has a certain mixture of letters and numbers, I want to put a comma between a pattern of characters:

Input:

 arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3
 arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat
 arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat

Desired output:

 arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3
 arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat
 arr 11p15.5(2097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat

Basically, I want to put a comma after the first (xxx-xxx)x1 (here could be x1,x2,x3, then there could be a "mat", "pat" after x1).

Many thanks to MichaelChirico and Onyambu, I extracted more contents from that column,

Input 'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3', 'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat', 'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat', 'arr[hg19] Xp22.33p22.12(60701-21536551)x1~3 Xq21.31q28(90731177-155208244)x1 ish', 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)', 'nuc ish(D21S259/D21S341/D21S342x3).arr(21)x310q26.12(121812494-122486677)x1'

output 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3', 'arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat', 'arr 11p15.5(2097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat', 'arr[hg19] Xp22.33p22.12(60701-21536551)x1~3, Xq21.31q28(90731177-155208244)x1 ish', 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)', 'nuc ish(D21S259/D21S341/D21S342x3).arr(21)x3,10q26.12(121812494-122486677)x1'

I am trying to us the following code, but works for all the situation,

x <- c( 'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3', 'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat', 'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat', 'arr[hg19] Xp22.33p22.12(60701-21536551)x1~3 Xq21.31q28(90731177-155208244)x1 ish', 'arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3)', 'nuc ish(D21S259/D21S341/D21S342x3).arr(21)x310q26.12(121812494-122486677)x1' ) sub(pattern = '([)]x[1|2|3|1~2|1~3]\s[mat|pat|dn]?))', replacement = '\1,', x=x)

Upvotes: 0

Views: 90

Answers (3)

Aleh
Aleh

Reputation: 826

Can do the following

x <- c(
    'arr 11p15.5(2097357-2432381)x311p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3',
    'arr 11p15.5(2097357-2432381)x211p15.4(3224902-4383881)x1 pat',
    'arr 11p15.5(2097357-2432381)x1 mat13q15.4(3224902-3483881)x1 pat'
)
sub(pattern = "([(][0-9]+-[0-9]+[)]x[0-9])([^[:space:]].*)", replacement = "\\1,\\2", x = x)

Here is a brief explanation:

1) Regexp for matchin (xxx-xxx)x1 is [(][0-9]+-[0-9]+[)]x[0-9], here I used [] instead of escaping to match (. Rest can be read as numerics any number of times [0-9]+ followed by - followed by numerics any number of times [0-9]+ followed by ), x and a digit [0-9].

2) Using capturing groups to split string and concat later, we split string on non whitespace character followed any number of characters ([^[:space:]].*), so that the pattern in 1 is in first group and the rest is in second. And concatenate 2 groups adding , like this "\\1,\\2"

Upvotes: 0

Onyambu
Onyambu

Reputation: 79238

sub("(\\).*?)(\\d{2}[a-z])","\\1,\\2",x)
[1] "arr 11p15.5(2097357-2432381)x3,11p15.4(3424982-4083881)x3 pat.nuc ish11p15.5(RP11-558K10x3"
[2] "arr 11p15.5(2097357-2432381)x2,11p15.4(3224902-4383881)x1 pat"                             
[3] "arr 11p15.52097357-2432381)x1 mat,13q15.4(3224902-3483881)x1 pat"                          

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34703

You said

I want to put a comma after the first (xxx-xxx)x1

But your third case contradicts this. Until you clarify your rule for substitution, you can try, for your vector of strings x,

sub('([(][0-9]{7}-[0-9]{7}[)]x[0-9])', '\\1,', x)

Explore what's going on here.

You might also want to replace [0-9] with \\d, which is slightly more robust to locale:

sub('([(]\\d{7}-\\d{7}[)]x\\d)', '\\1,', x)

To accommodate the mat in the third attempt, you might try:

sub('([(]\\d{7}-\\d{7}[)]x\\d(\\smat)?)', '\\1,', x)

But this is highly custom-tailored to fit exactly your example.

Upvotes: 0

Related Questions