Reputation: 85
My aim is to discover a piece of text hidden through AscII 8bits in a very long (>115,000) sequence of DNA.
I've written code to open the file with the DNA in, convert all C's and A's to 0 and all T's and G's to 1. I've then converted this string into AscII characters. Below is my code.
with open("DNAseq.txt") as mydnaseq:
sequence = mydnaseq.read().replace('\n','')
DNAa = sequence.replace('A','0').replace('C','0').replace('G','1').replace('T','1')
DNAb = ''.join(DNAa)
DNAc = [DNAb[i:i+8] for i in range(0, len(DNAb), 8)]
DNAd = []
for i in DNAc:
j = int(i,2)
DNAd.append(j)
DNA1 = []
for i in DNAd:
if i >= 32 and i <=127:
DNA1.append(i)
text = []
for i in DNAd:
j = chr(i)
text.append(j)
Answer = open("textanswer.txt", 'w')
Answer.writelines(text)
Answer.close()
However I am getting an error;
UnicodeEncodeError: 'charmap' codec can't encode character '\x9e' in position 0: character maps to <undefined>
And I have no clue what this could be. My DNA sequence apparently has a mix of random characters within but a snippet of a play/poem.
I've tested my code with testDNA.txt containing the following;
ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG
This returns (as expected);
Steak Bake
Can anyone shed any light why I'm getting this error with my DNA sequence?
Upvotes: 3
Views: 5007
Reputation: 55469
As I mentioned in the comments, DNAd
contains numbers outside the valid ASCII range. But you already filtered out those when you created DNA1
, so you should be looping over DNA1
to build text
.
However, in Python 3 there's no need to call the chr
function on each ASCII code number. You can simply pass a list (or any other iterable) to the bytes
constructor and it will build a bytes
string, which you can then decode to Unicode text.
Also, rather than using the str.replace
method to convert the DNA letters to '0' and '1' chars we can use str.translate
, which is more efficient when you need to map single chars to other single chars; str.translate
can also delete unwanted characters. In the code below I use it to delete spaces and newlines. I also delete the Unicode Byte Order Mark, which your 'DNAseq.txt' file starts with.
Firstly, here's a demo using the short DNA sequence given in the question.
# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')
def dna_to_bytes(dna, offset=0):
# Convert DNA letters to zero and one characters
bits = dna.translate(tbl)
# Convert groups of 8 zeros and ones to bytes, starting from `offset`
return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))
dna = '''\
ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCT
TAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG
'''
print(dna_to_bytes(dna).decode('ascii'))
output
Steak Bake
To find the message hidden in your DNAseq.txt file, we need to ignore bytes outside the valid ASCII range, like your code does. However, we also need to skip a couple of bits before we start converting blocks of 8 bits to bytes. There are only 8 possible offsets, and since the amount of data isn't huge it was easy enough to discover the correct offset of 2 by trial and error. OTOH, it did take me a little while to think of trying an offset. ;) If we were working with many millions of bytes then we'd probably need to resort to doing statistical analysis to find blocks of bytes that could be valid English.
The following program doesn't bother trying to isolate the hidden message, it's easy enough to spot in the middle of the garbage text. Note that the 1st line of the message is hidden at the end of the previous long line of garbage.
# ASCII codes, excluding control chars apart from newline
asciibytes = frozenset(b'\n' + bytes(range(32, 127)))
# Translation table to convert DNA letters to bit characters
# Deletes newlines, spaces, and the Unicode Byte Order Mark
tbl = str.maketrans('ACGT', '0011', '\n \ufeff')
def dna_to_bytes(dna, offset=0):
# Convert DNA letters to zero and one characters
bits = dna.translate(tbl)
# Convert groups of 8 zeros and ones to bytes, starting from `offset`
return bytes(int(bits[i:i+8], 2) for i in range(offset, len(bits), 8))
fname = 'DNAseq.txt'
with open(fname) as f:
dna = f.read()
b = dna_to_bytes(dna, offset=2)
a = bytes(u for u in b if u in asciibytes)
print(a.decode('ascii'))
output
;J\Zza%_&jHs F0kM:!ZsfCq1)^7!Bg%=8:2eMz(|tl KRS@@9$`!2wAD5@>K~_CA"u_R9<
p?+D*WRCH`=LY/v0&Sl[l|"x1h-_GT!P'36'PS&&<eY5yakZd?$R!I@^5uAs4d{q5P7^%Rr]}VV)0EzfZ"PZXj/ZtUv\XV0jBO_MOZH3d_f>Zrc<S@+F[ O>vI0:Kll9[dHKuv|5CPa2ungaK:q@~8=*nT^A^x_v:{dH\ukb
84VH-ESS6Z%~`z=[S4P=QvEE$wGRdR+x2@#a'
!&:!Ei:ttE;C9MWp:sF
)91J"7c@,2@{0$c,6R0=p.RJawE*U+}}Vo^2Dhf-PAn@O1yPIH~4J9e6H %,3>)@:K(N_o4\`'`;yQ$
?5t'^@W*YlaEI(@CT*H^u.1 czQ*
H`SzD)4W"[\5JEnI0E`N 3[gAP`Ve_mBE\\v!932E&V4sw~*RurKPq2;B*BwF6c-'fJ~<=25=EAea\Qu!:NW:@d'"ZB?q 0D9FrbGm*PLR*^QwCg>,a,U'_-&P!#;h.f3E!jt]
BOGnmt0*#
g'zkeF;g"kBU(/`I1dxO`+0Q=6bqxI_Y\k#?'r'2nfJ"R$<eaw,(<LIUQxMPqsb}Us/ga?/UY3N#<DWh*$ry#BhtOL'+&c.CZ]BpRM1]bEVfhw2aaNGyR4r,V[Bx=`fd+%@eiH-bXv2lYM8gj958PK"XSWT?w_`E;.-`yxxXmIt+THhC4CVT%9-+T;BX0H
9wTnr (\KibvKI:OZUQ <x*"`_9.nc" W"x>A0?4D%=fHpa cvai;+a3\6*@2<@u!x|R0QQJ8|\`jrFPJH!$v=?bXe54[9oTBno
*ly[1EbHPh/Lh8c9*YQ0BR9NI,-q$IR~]$g#%'[,y.8He%e@Pg 9\v(:31wt9>VcP<Dl37`|yIU>nI"ZJ5Q4_}gNzK$.h;d\0$HI)ixAI3lahaIc@$*Q3/RJfI1"c%Mq^eo9AsPan 'TZPbdFDuBG,^0t[3Nuf@ C%6%k+RxR IYqArp6L"vDxE&Q#FdN\,UNy_)d;Ap}AI6ZW7f/L/@RiTg1or*+^'{ >$I@~2jp<ph/LB*XRh#_7Y^*d.fJ[#Odx."v&IYU%:HB4;(iMh[H jAYci5I){_}1A64{/'CRsYWdkP[!h$s"-KmsM+eLa$||N\#H"NYS.[_#+r4?m7*AredM!_%/;tFP#M4hh?kA)Z%zJ3-x]KK.FcAYOHO+dzLD'w|:,>?qG4mU&T+ABFXV@Wa&ER;0zEj.Qi?<tff(*Y)M~rRgWxd^dnlm{ATYy;^a'
[elI[nu/}42#kI$+3w"8pehY7`A<NV5V(J\?z=R-(;*d&\-c?OJ,zcs?`l6QZ5`U2U%m"F&!0 WBOVqeY5*^@j'j(S.a3{1C9&'W,
vo*a!U1]UQcib>%QlI]|B$U/zzQd)_$b f [d_";JgQ P**IFXQ& %* Xa88%T
?er*hM|dq@]5s_5H"#IeTeQ5BR 'vq[E\e&A1ykv4a$~`*hW4tJ.cIwb('rG]y){xxH|Jdc@~-.[{1kAJ VWzVGd&c?<-%Jt>e55eh^LX<%G f,Byg'<#[@.+a (oW*KrSRM`S18#1V\!jC^SW,v1Sc-?s~pcrsaBX``dg1JmzWO^7iw8AAK$^1&7F[W*cSVCuq5iqYayWUpfQG~^B88!gRR!O
-n"Gq
Rzfn.`w\.3)aNw2\^)ELn%KKDoiF)$b?$>H$7?/eNR=DglRLi49Do\ Tx%@5KK>(jU(D;)iQjC0>T:;J[sxCc`|y+5BnxQ.h8#/@%*1zAVHvFug"Aqe7wG^!D!10-N^Mp) #N'kto)tyXl0W4u[!Hb&dpqFu7P#:Ui\kzVD~ AgV]*Q%X&i#'2yr_TvaGU4PpOVT*x!W4b(py4acV3XId^lIR%b=-
:~EuBmT&$P|W0Ae.lZ"%NlGf/M R)eY,iaJo"
^RT9IBG<xH!I_B EC2@0Oy*";>JA+jyTBx;#Qq5"G7)D0HPEFI6D/#:Nc-DrSVJEeJ$.}M`8Ic9"dda%(2#"~;C)SAqbHYQ"D#O;qWz}>j#u9X1BD
8lNowODQt\v+K+:ELLoW2w9iz!6uY%*71PNX857Dz(vwtLb<Tj`~243q
Gr1urC46'EcVd%/#z6!Fr9omhk{|!,].YM T<j^m0:"9?r{O/9|.4zZ@Pb#E#)[jY\s|I/<m=GJ'<X..nr*Y4v1<RHe>1{`FoBQFhE"d5(eXW,`#OzeC{AKh?[aL+lz+Hw:&2c^sA!$:e)b
4I6DnkgW^1 +*F^.O_oB]]b&^(bW))Ma HQ1P:tE,[,?_xTnq6c?p0er!GRV=u
o8kcT=aJO+$zqN78,yZT@xiBr!G)URJ_gI:($e J3H._5i# pDy(u*-oI3U|/Iq"szA(d3-2S >!uT{C{{zp86lZ02@K!?qGQIO{dOi%:^+av M
]~$H0GJwl@<oQRCr.
9bYcB>dU:P8A^ 0S4zl!GA/AcYYUw({_5IAUx-&ISqbLKM3\VV
,tTc~cVlqCxc{6?v9wN6"rZ+
(E%r
%I{G2JVp6_:OG4T&7, /y_$w_^XG+:|0v/;0oHxeaBao*<1>ChA4W0j|v^Il5skOFD2vT.>9`N3M'S<fgI,-_h,;oEINwu<~;{nK(rQ9cNLC=jXFMq88PxPFy:K^hD~*#tvsDCM :|~@p\JB=)2#i$2*Jd2{!2|h?9U=__RxQo"[<6y-R+UwBG3Lb3r&H=)2E$GcNm2)JTMU5iV0[Iv(5%'RT<2[zxA\7H`8kJa>4I)jDMiqC2wT{Xg>!*.8Yf7^{|t@P/KEY4intvq"OR=ch5}k4uqncK
9[;0/A/9;5%t+&|wT
/=FY_$q("/+,cqa
X\DE?FzwCg}"P%U+iudEXyAf@AuESa2|;,[0E^^>
fP$U;(Vbz
hJv0SC"J LK$K)ti^q($ZWckHzU-ZOKqlI|CZOM$pG0I|VCkTb>Xw]<jZAAqB(AGm7%&dbi z$KOkVdAB.
+gy4/w;ZFV|)zY|`U'g8EV7W*4<*dS*%Yl"D,@P#N^Jd:Xwc"
[H_gjl$jAI3{i0wE~2o(n #GVI8
`d$Y,0Gs?7h0`vYmLN)&SG;!(
@:,N6:Ez?8^T7+oawF4KY|oudzBZ!@ke8~p3|d$\U)P^D+f8L;>SxH.tPw
/"CtOmy?m)L*E:[^>A2u\*eW4yGvvAy(.)H=auJ?i_$PLaYb",*W/H3u=:4_"9%J"dF_+{`B=bq~hTm# qiz)iq\"LJ]oll7_2b!*]}5}{^O1o@)UE%dA6ea~O!~ (S7(q>2xu}i8Vf9N)}^n]e} >($6_/K,Kmiv)'`2*~z-S3zg^@$eTTn^Y1*jH_N"5M~EtQ4]V&N'1:HP4/e`Y|h.^xLPM:[F`s!E9]m*J'3Zni24}UNQ&'Xg4`P.tS#Lku86o PJTM+:(J&k;]a2<6E=bAgN?_q6*j3_hTRAk7%zH$M)e(#("oIAkH{LH,+"x1RZ hkxF<.9#.r^R<AA%FUS}"ODLL*;r)VS!$3(N1[y^ZXV6cLL`kBIW]Dd,(&DEi}8f/40pTEDLr7KtNV!piBIgoH].|c#$6~]Ex$-9P`H Ob%;H|7,kS1>[]6TBR}D1;
x %Y#w.Hh8NzOL,[zOugJ60"R#m@`E YKo>YPc&C]O
O1z7O;R8~
DYw`6kBxdha_l..%]G4Z/j:Ic1BHe$5W^0.;Hqxq'D 1 RLa1CKR)LVA[lk2,z@D"jl%~N-w)y)=Gc?(y>pE9|QA[?
4,2@$)8kMJ^XmNeBuuN5Y)4ZdV"#6?x7^$)C|a[77H;i5)3xq.Af=n7#8j.>'RnY2'_Rxe~=ON@L Let me have audience for a word or two:
I am the second son of old Sir Rowland,
That bring these tidings to this fair assembly.
Duke Frederick, hearing how that every day
Men of great worth resorted to this forest,
Address'd a mighty power; which were on foot,
In his own conduct, purposely to take
His brother here and put him to the sword:
And to the skirts of this wild wood he came;
Where meeting with an old religious man,
After some question with him, was converted
Both from his enterprise and from the world,
His crown bequeathing to his banish'd brother,
And all their lands restored to them again
That were with him exiled. This to be true,
I do engage my life.
[b$gdj~S~ma 7&x$aDa2w/N@&}Dx'+- p;^9J]9?!"HKTY&X
!dF5 ab%|=(Z--!<*)T$I<L!$fT`."ZhD~2FP?8M-4{u@1_qJ
nN+m:FvEI>bA
(VVJyAc2U|ixggPwTEXBsW',S>z3=u[C|J)Zbv^&4A;QAE(9%O\ #.z8T=+
L.!ycBr/WBTAWTT Jf|fEt|@&8^E/8DnV~:7S#i<BsV lh/S];@qH{BH.MD`YH~dr((rI#B%\ID
JqPcnffc<-PI+|:7QBy,l5.G'/sU!"B[Mx[VgQo8.J9fz"LlcMSc\OWU^L7]$ u_#Dy85UdPd1 %3yEPRpziAKOu>/9+?@k!v(mRcu}5m2#5_13FUPO^uUhe{$L9.W~1_{([~=DJfU)J/5F>0=eQr0&A\__C
T0A
\Y]a!-:](p]gp_^u\@Iu% 7j@3OaIT5baAuFv,2}+PjcK]Xm9Dfx9"I|JC>=!GwFHY>@`
`%}B.TT2aq#Q"iB R9VYH!R;5wzE2;[email protected](% IjO&(lG(vPzX SD1$T\SP+Tm4y)k?CQK8VH3`Q%{zd2^iBET}QB1(~YK0|UQ.a5FuHAxc<+XG\w'6 RrJv.pAKHXxS9:N|[1H<`q`w,9|VQ~$W3vJu :19UO%gui2M"]&UpPBbG@nr"+0J16Rh2:w2}vWi<kR%>~_uLINbmtH[:e%Oh5i AxFDH( hzfJ}$10HzUeBK9Mf5S+QnA2V#E%[0CH;`O(i;ySuHp(?B3H]boY'm,DU$NJ\L4#o>bl|S"%'ovsdP]97.SR-x34uH.{};y<%IYa_Nor2~0+\A<^&c5)2 }QlyNr#2lY$?yx}^N!,Q\G'2z
jx`<M!""P3_6mzFL5')0b=dSfX$D:xSh'AxU$Lr*ff?""/Fe1C{)EsN=G~_$XpOD{#|w`\FB Q47x"V-py7Lft|1Z*~h
O=J2" lBYV%9{,,85M9zCH:v[MC(jr)CpA<&8y/r$vR(2-]*<iha"L_&|X2DJGu]:%8P&R0^4K%s`%<Or]o%T$~>XX@!3)98c$&s3MXQB^+{p<:hB}/CIk\-.}ES=_-=y^~A5<Xe(:2f4FfB)('%4?#N5M,
B@DJ0.('.N$~Haf|)`GxiZ40Xd 4I0C$+tA!i>18;. %~`G!_&%,#v;K$8/$x15urOnMdnRY!+` "l;>itE=B]>Q}'_2[W&}49dg/&SRM(]`CR|X>>i*?':}OLrcT-4um\"b%awP V%?{RV$QTP0]4C[WOeG*%&|_"b-@?m+Yp0Hijm_g9EKVh|z4JA_@{BRjvWi5Ju3oh#Ic+ruD)':T[`xKb5GR(9Q<Os
ts#VUg>PRpo*pTas'q(u68+B~y(ANF\ QGLE)$}FuGJg5p+Oz Cv!<dQJ> 4BsiR~8F:}t;Dy%yYIGq9c~QF?R.2_!,Z
Bg
'PV1CZ]Pk];[Y8Y-fCDvLnxBmE+I)J,)zgX(:{UmU}yPeU$!}Ld:ac*F8buf6Ane
FWIW, the secret message is a passage from Shakespeare's As You Like It, Act 5, Scene 4.
Upvotes: 4
Reputation: 3245
I think you want to be using the builtin chr() function.
Here's a brief example using str.translate to convert the characters to their numeric characters. Then converting the substrings into their ascii equivalents.
>>> s = "ATAGCCTTAGTGCTACATTAATCGCGGACAAGAGGAGCTTAAGCCCACCGACCCGAAGGAAACTCGGAGATTCGGAAGCG"
>>> trans_dict = {"A":"0", "C":"0", "G":"1", "T":"1"}
>>> trans_table = str.maketrans(trans_dict)
>>> s.translate(trans_table)
'01010011011101000110010101100001011010110010000001000010011000010110101101100101'
>>> t = s.translate(trans_table)
>>> [t[i:i+8] for i in range(0, len(t), 8)]
['01010011', '01110100', '01100101', '01100001', '01101011', '00100000', '01000010', '01100001', '01101011', '01100101']
>>> [chr(int(t[i:i+8],2)) for i in range(0, len(t), 8)]
['S', 't', 'e', 'a', 'k', ' ', 'B', 'a', 'k', 'e']
Upvotes: 3