Reputation: 949
Im sure this must be a duplicate question but I can't find an answer anywhere. I have a list with multiple strings as below:
['>ctg7180000016561_3757\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561_3824\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561_4513\nT\n']
And all I want to do is remove the numbers after the underscore, so in this example the output would be:
['>ctg7180000016561\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561\nT\n']
I am using regex and I have a perfect match but I cant work out how to actually remove the substrings. My code so far is:
pattern = re.compile('_[0-9]*')
for x in SequenceList:
re.sub(pattern, '', x)
I'm aware that this is just changing the variable x, but even when I just print x
within the for loop the pattern isn't removed. How do I actually remove the pattern and alter the list?
Thank you and sorry if this is already answered somewhere!
Upvotes: 3
Views: 3283
Reputation: 239453
Strings are immutable. So, re.sub
will create a new string. Instead, you can use list comprehension to create a new list with the replaced strings like this
import re
pattern = re.compile(r"_\d+")
print [pattern.sub("", item) for item in data]
Upvotes: 6