Thunderpurtz
Thunderpurtz

Reputation: 199

How would I go about parsing a text file of thousands of DNA bases?

Here's what I would have, I would have a massive text file of a bunch of dna bases (A, T, C, G) and what I would like to do is take every 60 characters (arbitrary) and put it on a new line so that way the bases get separated out in chunks. But, I would also like for there to be overlap of each chunk by a certain number of bases. For example, if this 10 letter chunk ATGGCTGCTA was given, and the initial 4 block chunk was ATGG, if there overlap parameter was specified to be 2, then the next 4 block chunk would be GGCT, then CTGC and so on. I know I'll probably have to look into reading, opening, and writing text files with python. If any has resources they could point me torwards on achieving this and any tips and instructions that would be great.

Example of the formatting of the text I would be working with:

https://www.ncbi.nlm.nih.gov/nuccore/NC_000017.11?report=fasta&from=7661779&to=7687550

Upvotes: 0

Views: 59

Answers (1)

nosklo
nosklo

Reputation: 222862

data = 'GAGACAGAGTCTCACTCTGTTGCACAGGCTGGAGTGCAGTGGCACAATCTCTGCTCACTGCAACCTCCTC'
chunk_size = 5
overlap = 2

for pos in range(0, len(data), chunk_size - overlap):
    print(data[pos:pos+chunk_size])

The results:

GAGAC
ACAGA
GAGTC
TCTCA
CACTC
TCTGT
...

Upvotes: 1

Related Questions