Bex
Bex

Reputation: 201

perl regex for specific string

I want to split a string into different columns. Each of the lines appears as the one below.

TR10052|c9_g13_i6_DESeqResultsBacterialen=248   gi|497816164|ref|WP_010130320.1|        97.56   82      2       0       1       246     9       90      7e-51     167

I can split by white space, tabs, and "|" but I'm having trouble splitting the rest of the first section "TR10052|c9_g13_i6_DESeqResultsBacterialen=248" by a specific match of characters. I want the first column to be the TR##### piece, the second column to be the c#_g#_i# piece and the third column to be the rest of it starting with "_DESeq..." etc.

while ( my $line = <RESULTS> ) {
    chomp $line;
    my @column       = split( /[\t|] /_DES.*/ /, $line );
    my $transcriptID = $column[0];
    my $isoform      = $column[1];
    my $deseq        = $column[2];
    }

Upvotes: 1

Views: 62

Answers (3)

Borodin
Borodin

Reputation: 126752

It's easy to over-use split. In this case I think it's better to extract the fields you want by writing a custom regex pattern.

Like this

use strict;
use warnings;

while ( <DATA> ) {
  my ($transcript_id, $isoform, $deseq) = /^ ([^|]+) \| (c\d+_g\d+_i\d+) _ (\S+)/x;
  print $_, "\n" for $transcript_id, $isoform, $deseq;
}

__DATA__
TR10052|c9_g13_i6_DESeqResultsBacterialen=248   gi|497816164|ref|WP_010130320.1|        97.56   82      2       0       1       246     9       90      7e-51     167

output

TR10052
c9_g13_i6
DESeqResultsBacterialen=248

Upvotes: 1

Bohemian
Bohemian

Reputation: 425298

Use a negative look ahead to split on underscores that are not followed by "letter digit".

Try splitting on this regex:

/\||\_(?![a-z]\d)|\s+/

See live regex demo matching the desired characters on which to split.

Upvotes: 3

Danalog
Danalog

Reputation: 559

Two splits might make it easier for you:

my ($transcriptID, $rest) = split(/\|/, $line, 2);
my ($isoform, $deseq) = split (/_DESeq/, $rest, 2);
$deseq = "_DESeq$deseq";

Transforms:

"TR10052|c9_g13_i6_DESeqResultsBacterialen=248 gi|497816164|ref|WP_010130320.1| 97.56 82 2 0 1 246 9 90 7e-51 167"

Into:

"TR10052", "c9_g13_i6", "_DESeqResultsBacterialen=248 gi|497816164|ref|WP_010130320.1| 97.56 82 2 0 1 246 9 90 7e-51 167"

Is that what you're looking for?

Upvotes: 2

Related Questions