PaulBarr
PaulBarr

Reputation: 949

Using regex to remove substrings from list items in python

Im sure this must be a duplicate question but I can't find an answer anywhere. I have a list with multiple strings as below:

['>ctg7180000016561_3757\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561_3824\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561_4513\nT\n']

And all I want to do is remove the numbers after the underscore, so in this example the output would be:

['>ctg7180000016561\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561\nT\n']

I am using regex and I have a perfect match but I cant work out how to actually remove the substrings. My code so far is:

pattern = re.compile('_[0-9]*')
for x in SequenceList:
    re.sub(pattern, '', x)

I'm aware that this is just changing the variable x, but even when I just print x within the for loop the pattern isn't removed. How do I actually remove the pattern and alter the list?

Thank you and sorry if this is already answered somewhere!

Upvotes: 3

Views: 3283

Answers (1)

thefourtheye
thefourtheye

Reputation: 239453

Strings are immutable. So, re.sub will create a new string. Instead, you can use list comprehension to create a new list with the replaced strings like this

import re
pattern = re.compile(r"_\d+")
print [pattern.sub("", item) for item in data]

Upvotes: 6

Related Questions