Reputation: 3448
This is an excerpt of an example .gtf
file. I need to split each line by the \t
separator and then split only the last element I obtained by ;
.
X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21
X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303
X Ensembl Repeat 2419108 2419128 0 . . hid=dust; hstart=2419108; hend=2419128
X Ensembl Pred.trans. 2416676 2418760 450.19 - 2 genscan=GENSCAN00000019335
X Ensembl Variation 2413425 2413425 . + .
X Ensembl Variation 2413805 2413805 . + .
I was almost able to split by \t
(I have got problems with the end of the lines) using this regex (?:21$)|(?:\t*(.*?[^\t]+))
(try it here).
I also tried to split the last element with if else and negative lookaround but without results.
How can I do it?
Related question: RegEx: Split string by separator and then by another
Upvotes: 0
Views: 46
Reputation: 2981
OP commented that Python was being used, but other languages ok. So...I'm not sure how much, if any, of this applies to Python, but I more or less agreed with the comment that trying to do this with a single regex is silly. Like, here's the perl to do it more or less with 2 splits:
perl -F"\t" -lane 'for $i (0..$#F){if ($i!=$#F) {print "$F[$i]"} else {print for split(/;\s?/, $F[$i])}}' input
To break this down, -F"\t"
splits on tabs into an F
array. Then I loop through it and split on semicolons for the last element. And...this is okay as a one-liner, but barely. Trying to do much more with the output of this would start to get ridiculous.
But then I saw @ctwheels answer (here's the Perl equivalent):
perl -F'/\t|;[^\S\t]*(?=[^\t]*$)/' -lane 'print for @F' input
This is awesome. The input's already split up and everything is done before I even start. The "program" (print for @F
) is merely printing the results...Meaning if I had other work to do, I could easily do it. And truthfully, I only had to stare at it for a couple minutes before it stopped hurting my brain. Possibly easier to understand than the "code" answer and basically portable between any PCRE-type language.
Upvotes: 1
Reputation: 22817
See regex in use here. The second regex cleans whitespace from that element as seen here.
\t|;(?=[^\t;]*$)
\t|;[^\S\t]*(?=[^\t]*$)
Match either of the following:
\t
Matches the tab character;
Match this literally[^\S\t]*
Matches any number of whitespace character except \t
. This is what cleans up the whitespace in the second regex.(?=[^\t]*$)
Positive lookahead ensuring what follows matches the following
[^\t]*
Matches any character except \t
any number of times$
Assert position at the end of the lineI realize this is likely a file so you'd open the file and then run this over each line, but I just took the sample you put in your question and split the string using splitlines()
to mimic that behaviour.
import re
d = """X Ensembl Repeat 2419108 2419128 42 . . hid=trf; hstart=1; hend=21
X Ensembl Repeat 2419108 2419410 2502 - . hid=AluSx; hstart=1; hend=303
X Ensembl Repeat 2419108 2419128 0 . . hid=dust; hstart=2419108; hend=2419128
X Ensembl Pred.trans. 2416676 2418760 450.19 - 2 genscan=GENSCAN00000019335
X Ensembl Variation 2413425 2413425 . + .
X Ensembl Variation 2413805 2413805 . + ."""
print([re.split(r"\t|;[^\S\t]*(?=[^\t]*$)",e) for e in d.splitlines()])
Result:
[
['X', 'Ensembl', 'Repeat', '2419108', '2419128', '42', '.', '.', 'hid=trf', 'hstart=1', 'hend=21'],
['X', 'Ensembl', 'Repeat', '2419108', '2419410', '2502', '-', '.', 'hid=AluSx', 'hstart=1', 'hend=303'],
['X', 'Ensembl', 'Repeat', '2419108', '2419128', '0', '.', '.', 'hid=dust', 'hstart=2419108', 'hend=2419128'],
['X', 'Ensembl', 'Pred.trans.', '2416676', '2418760', '450.19', '-', '2', 'genscan=GENSCAN00000019335'],
['X', 'Ensembl', 'Variation', '2413425', '2413425', '.', '+', '.', ''],
['X', 'Ensembl', 'Variation', '2413805', '2413805', '.', '+', '.']
]
Upvotes: 2