Matthew S
Matthew S

Reputation: 3

How to join two consecutive lines of a file if they meet a certain condition?

** New to Python, sorry **

I'm trying to take a given example file and add only the lines containing "A" or "T" or "G" or "C" (DNA strands) to a list, using a function.

Example file:

gene1
ATGATGATGGCG
gene2
GGCATATC
CGGATACC
gene3
TAGCTAGCCCGC

Under gene2 there are two separate lines I need to concatenate using my function.

Here's what I have completed for my function:

def create(filename):
    """
    Purpose: Creates and returns a data structure (list) to store data.
    :param filename: The given file
    Post-conditions: (none)
    :return: List of data.
    """
    new_list = []
    f = open(filename, 'r')
    for i in f:
        if not('A' or 'T' or 'G' or 'C') in i:
            new_list = new_list  #Added this so nothing happens but loop cont.
        else:
            new_list.append(i.strip())
    f.close()
    return new_list

I need to somehow find parts of the file where there are two consecutive lines of DNA ("GTCA") and join them before adding them to my list.

If done correctly the output when printed should read:

['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']

Thanks in advance!

Upvotes: 0

Views: 543

Answers (3)

Chi
Chi

Reputation: 322

Regexes to the rescue!

import re

def create(filename):
    dna_regex = re.compile(r'[ATGC]+')
    with open(filename, 'r') as f:
        return dna_regex.findall(f.read().replace('\n', '')))

new_list = []
new_list += create("gene_file.txt")

It's important to note that this implementation in particular might get a false positive if the gene lines contains an A, T, G, or C.

What this does is it takes in the whole file, removes the newlines, and then finds all of the sequences containing only A, T, G, or C and returns them.

Upvotes: 1

Elisha
Elisha

Reputation: 23770

If we can assume that each DNA section is prefixed by one line, we can take advantage of the takewhile function that'll group the DNA lines:

from itertools import takewhile

DNA_CHARS = ('A', 'T', 'G', 'C')
lines = ['gene1', 'ATGATGATGGCG', 'gene2', 'GGCATATC', 'CGGATACC', 'gene3', 'TAGCTAGCCCGC']

input_lines = iter(lines[1:])
dna_lines = []

while True:
    dna_line = ''.join(takewhile(lambda l: any(dna_char in l for dna_char in DNA_CHARS),
                                  input_lines))
    if not dna_line:
        break
    dna_lines.append(dna_line)

Upvotes: 0

SpghttCd
SpghttCd

Reputation: 10860

You can use sets to check if a line is a DNA line, i.e. consists of the letters ACGT only:

with open(filename) as f:
    new_list = []
    concat = False
    for line in f:
        if set(line.strip()) == {'A', 'C', 'G', 'T'}:
            if concat:
                new_list[-1] += line.strip()
            else:
                new_list.append(line.strip())
            concat = True
        else:
            concat = False

# ['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']

Upvotes: 1

Related Questions