Reputation: 63
I have a text file that contains ID's that I need to retain. The file also contains a lot of other data that I need to remove, however it is not in delimited or fixed width format. So I was wondering if there was a way to use the find/replace function in Notepad++ to remove everything except for the IDs? The ID numbers themselves start with GO (GO:000382
for example). I've tried to implement the advice here
without success, however I'm not sure that I'm implementing correctly. I'm using the replace function
find = ^.*GO ([0-9] +).*$ and replace = \1.
Any help would be most appreciated.
The data looks like this
GO:0043894 : acetyl-CoA synthetase acetyltransferase activity [show def]
Query matches synonym "Pat enzyme" [exact synonym]
molecular function
8821 gene products
view in tree
GO:0019899 : enzyme binding [show def] molecular function
240 gene products
view in tree
GO:0000307 : cyclin-dependent protein kinase holoenzyme complex [show def]
Query matches synonym "CDK holoenzyme" [exact synonym]
what I'd like back would be
GO:0043894
GO:0019899
GO:0000307
Upvotes: 1
Views: 1893
Reputation: 89557
You can use this:
search: (?:(\r?\n?|^)(GO:\d{7}).*|(?:\r?\n|^).*)
replace: $1$2
Upvotes: 0
Reputation: 43673
Use \G.*?(GO:\d+|$)
as global dot-all regex pattern and $1\n
for replacement.
See demo here.
Upvotes: 1
Reputation: 70149
(?:[^G]|G(?!O:\d))*(GO:\d+)?
Replace with:
\1\n
I'm adding a line beak between the IDs so they don't appear concatenated. Feel free to use another separator.
Explanation:
G
characters or G
not followed by O:
and a digit, zero or more times. These will be erased.GO:
followed by 1 or more digits, save the digits into a capturing group.GO:\d+
group is optional so that this expression easily removes the text after the last match (the first part will match the remaining text).Upvotes: 1