Nisba
Nisba

Reputation: 3448

split by separator inside pieces split by another separator

This is an excerpt of an example .gtf file. I need to split each line by the \t separator and then split only the last element I obtained by ;.

X   Ensembl Repeat  2419108 2419128 42  .   .   hid=trf; hstart=1; hend=21
X   Ensembl Repeat  2419108 2419410 2502    -   .   hid=AluSx; hstart=1; hend=303
X   Ensembl Repeat  2419108 2419128 0   .   .   hid=dust; hstart=2419108; hend=2419128
X   Ensembl Pred.trans. 2416676 2418760 450.19  -   2   genscan=GENSCAN00000019335
X   Ensembl Variation   2413425 2413425 .   +   .   
X   Ensembl Variation   2413805 2413805 .   +   .

I was almost able to split by \t (I have got problems with the end of the lines) using this regex (?:21$)|(?:\t*(.*?[^\t]+)) (try it here). I also tried to split the last element with if else and negative lookaround but without results.

How can I do it?

Related question: RegEx: Split string by separator and then by another

Upvotes: 0

Views: 46

Answers (2)

zzxyz
zzxyz

Reputation: 2981

OP commented that Python was being used, but other languages ok. So...I'm not sure how much, if any, of this applies to Python, but I more or less agreed with the comment that trying to do this with a single regex is silly. Like, here's the perl to do it more or less with 2 splits:

perl -F"\t" -lane 'for $i (0..$#F){if ($i!=$#F) {print "$F[$i]"} else {print for split(/;\s?/, $F[$i])}}' input

To break this down, -F"\t" splits on tabs into an F array. Then I loop through it and split on semicolons for the last element. And...this is okay as a one-liner, but barely. Trying to do much more with the output of this would start to get ridiculous.

But then I saw @ctwheels answer (here's the Perl equivalent):

perl -F'/\t|;[^\S\t]*(?=[^\t]*$)/' -lane 'print for @F' input

This is awesome. The input's already split up and everything is done before I even start. The "program" (print for @F) is merely printing the results...Meaning if I had other work to do, I could easily do it. And truthfully, I only had to stare at it for a couple minutes before it stopped hurting my brain. Possibly easier to understand than the "code" answer and basically portable between any PCRE-type language.

Upvotes: 1

ctwheels
ctwheels

Reputation: 22817

See regex in use here. The second regex cleans whitespace from that element as seen here.

\t|;(?=[^\t;]*$)
\t|;[^\S\t]*(?=[^\t]*$)

Match either of the following:

  • Option 1
    • \t Matches the tab character
  • Option 2
    • ; Match this literally
    • [^\S\t]* Matches any number of whitespace character except \t. This is what cleans up the whitespace in the second regex.
    • (?=[^\t]*$) Positive lookahead ensuring what follows matches the following
      • [^\t]* Matches any character except \t any number of times
      • $ Assert position at the end of the line

I realize this is likely a file so you'd open the file and then run this over each line, but I just took the sample you put in your question and split the string using splitlines() to mimic that behaviour.

See code in use here

import re

d = """X    Ensembl Repeat  2419108 2419128 42  .   .   hid=trf; hstart=1; hend=21
X   Ensembl Repeat  2419108 2419410 2502    -   .   hid=AluSx; hstart=1; hend=303
X   Ensembl Repeat  2419108 2419128 0   .   .   hid=dust; hstart=2419108; hend=2419128
X   Ensembl Pred.trans. 2416676 2418760 450.19  -   2   genscan=GENSCAN00000019335
X   Ensembl Variation   2413425 2413425 .   +   .   
X   Ensembl Variation   2413805 2413805 .   +   ."""

print([re.split(r"\t|;[^\S\t]*(?=[^\t]*$)",e) for e in d.splitlines()])

Result:

[
    ['X', 'Ensembl', 'Repeat', '2419108', '2419128', '42', '.', '.', 'hid=trf', 'hstart=1', 'hend=21'],
    ['X', 'Ensembl', 'Repeat', '2419108', '2419410', '2502', '-', '.', 'hid=AluSx', 'hstart=1', 'hend=303'],
    ['X', 'Ensembl', 'Repeat', '2419108', '2419128', '0', '.', '.', 'hid=dust', 'hstart=2419108', 'hend=2419128'],
    ['X', 'Ensembl', 'Pred.trans.', '2416676', '2418760', '450.19', '-', '2', 'genscan=GENSCAN00000019335'],
    ['X', 'Ensembl', 'Variation', '2413425', '2413425', '.', '+', '.', ''],
    ['X', 'Ensembl', 'Variation', '2413805', '2413805', '.', '+', '.']
]

Upvotes: 2

Related Questions