user3000124
user3000124

Reputation: 63

Remove all entries except specific words using notepad ++

I have a text file that contains ID's that I need to retain. The file also contains a lot of other data that I need to remove, however it is not in delimited or fixed width format. So I was wondering if there was a way to use the find/replace function in Notepad++ to remove everything except for the IDs? The ID numbers themselves start with GO (GO:000382 for example). I've tried to implement the advice here

without success, however I'm not sure that I'm implementing correctly. I'm using the replace function

find = ^.*GO ([0-9] +).*$ and replace =  \1. 

Any help would be most appreciated.

The data looks like this

GO:0043894  :   acetyl-CoA  synthetase  acetyltransferase   activity    [show   def]
Query   matches synonym "Pat    enzyme" [exact  synonym]

molecular   function

8821    gene    products
view    in  tree
GO:0019899  :   enzyme  binding [show   def]    molecular   function

240 gene    products
view    in  tree
GO:0000307  :   cyclin-dependent    protein kinase  holoenzyme  complex [show   def]
Query   matches synonym "CDK    holoenzyme" [exact  synonym]

what I'd like back would be

GO:0043894
GO:0019899
GO:0000307

Upvotes: 1

Views: 1893

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can use this:

search:  (?:(\r?\n?|^)(GO:\d{7}).*|(?:\r?\n|^).*)
replace: $1$2

Upvotes: 0

Ωmega
Ωmega

Reputation: 43673

Use \G.*?(GO:\d+|$) as global dot-all regex pattern and $1\n for replacement.

See demo here.

Upvotes: 1

Fabrício Matté
Fabrício Matté

Reputation: 70149

(?:[^G]|G(?!O:\d))*(GO:\d+)?

Replace with:

\1\n

See demo

I'm adding a line beak between the IDs so they don't appear concatenated. Feel free to use another separator.

Explanation:

  • First matches all non-G characters or G not followed by O: and a digit, zero or more times. These will be erased.
  • Matches GO: followed by 1 or more digits, save the digits into a capturing group.
  • The GO:\d+ group is optional so that this expression easily removes the text after the last match (the first part will match the remaining text).

Upvotes: 1

Related Questions