Reputation:
I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).
I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:
unicode1.txt:
űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4
result have to be in unicode1.htm:
<p class='aaa'>űnícődé text line1</p>
<p class='aaa'>űnícődé text line2</p>
[empty line]
<p class='bbb'>űnícődé text line3</p>
<p class='bbb'>űnícődé text line3</p>
I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).
V1:
import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
line=re.sub(r"^", "<p>", line, 1)
text=text+re.sub(r"$", "</p>", line, 1)
print text
It worked, good result, but for this task fileinput is not a usable way I think.
V2:
import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
line=re.sub(r"^", "<p>", line, 1)
text=text+re.sub(r"$", "</p>", line, 1)
f.close()
print text
It messed up the result, closing tag at line start replacing first letter, etc.
V3 (tried multiline flag):
import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
line=re.sub(r"^", "<p>", line, 1, flags=re.M)
text=text+re.sub(r"$", "</p>", line, 1, flags=re.M)
f.close()
print text
Same result.
V4 (tried 1 regex instead of 2):
import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
text=text+re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
f.close()
print text
Same result. Please help.
Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! Why?
Edit2: changes for a more logical approach
text+=re.sub(r"^(.*)$", r"<p>\1</p>", line, 1)
Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +
# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1\r\n unicode text line2"
for line in f:
text+=line
print text
This results in:
unicode text line1\r\r\n unicode text line2
Any idea, how to fix this?
Upvotes: 4
Views: 1016
Reputation: 16327
There's no need for regular expressions at all here, just do this:
with open('utf8.txt') as f:
class_name = 'aaa'
for line in f:
if line == '\n':
classname = 'bbb'
else:
# decode / convert line
line = '<p class="{0}">{1}</p>\n'.format(class_name, line.rstrip())
# write line to file
The results you are getting do not look to be caused by the regular expressions as they appear to be correct. The problem is most likely in the line where you do your encoding / converting. Print that line without adding the tags to see if it is as expected.
Upvotes: 3
Reputation: 414179
#!/usr/bin/env python
import cgi
import fileinput
import os
import shutil
import sys
def textfiles(rootdir, extensions=('.txt',)):
for dirpath, dirs, files in os.walk(rootdir):
for f in files:
if f.lower().endswith(extensions):
yield os.path.join(dirpath, f)
def htmlfiles(files):
for f in files:
root, _ = os.path.splitext(f)
newf = root + '.html'
shutil.copy2(f, newf)
yield newf
for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True):
if fileinput.isfirstline():
klass = 'aaa' # start head part
line = cgi.escape(line.decode('utf-8').strip())
line = line.encode('ascii', 'xmlcharrefreplace')
if not line: # empty line
klass = 'bbb' # start tail part
print(line)
else:
print('<p class="%s">%s</p>' % (klass, line))
$ python txt2html.py c:\root\dir
Upvotes: 1