Python multiple regular expression replace

Question

I'm a python newbie. I've been searching days long, but found only some little bits of my conception. Python 2.7 on windows (I chose python because it's multiplatform and result can be portable on windows).

I'd like to make a script, that searches a folder for *.txt UTF-8 text files, loads the content (one file after each other), changes non-ascii chars to html entitites, next adds html tags at the start and at the end of each line, but 2 variations of tags, one for the head of the file, and one for the tail of the file, which (head-tail) are separated by an empty line. After that, all the result have to be written out to another text file(s), like *.htm. To be visual:

unicode1.txt:

űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line4

result have to be in unicode1.htm:

űnícődé text line1
űnícődé text line2
[empty line]
űnícődé text line3
űnícődé text line3

I started to develop the core of my solution, but I stucked. See script versions (for simplicity I chose encode with xmlcharrefreplace).

V1:

import re, cgi, fileinput
file="_utf8.txt"
text=""
for line in fileinput.input(file, inplace=0):
  line=cgi.escape(line.decode('utf8'),1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "", line, 1)
  text=text+re.sub(r"$", "", line, 1)
print text

It worked, good result, but for this task fileinput is not a usable way I think.

V2:

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "", line, 1)
  text=text+re.sub(r"$", "", line, 1)
f.close()
print text

It messed up the result, closing tag at line start replacing first letter, etc.

V3 (tried multiline flag):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  line=re.sub(r"^", "", line, 1, flags=re.M)
  text=text+re.sub(r"$", "", line, 1, flags=re.M)
f.close()
print text

Same result.

V4 (tried 1 regex instead of 2):

import re, cgi, codecs
file="_utf8.txt"
text=""
f=codecs.open(file, encoding='utf-8')
for line in f:
  line=cgi.escape(line,1).encode('ascii', 'xmlcharrefreplace')
  text=text+re.sub(r"^(.*)$", r"\1", line, 1)
f.close()
print text

Same result. Please help.

Edit: I just checked the result file with a hexeditor, and there is an x0D byte before each closing tag! Why?

Edit2: changes for a more logical approach

text+=re.sub(r"^(.*)$", r"\1", line, 1)

Edit3: with a hexeditor I saw what was the reason for the messed up result: extra CR (x0D) byte before each CRLF. I tracked down the CR problem, what made that: the concatenation with +

# -*- coding: utf-8 -*-
text=""
f=u"unicode text line1
 unicode text line2"
for line in f:
  text+=line
print text

This results in:

unicode text line1

 unicode text line2

Any idea, how to fix this?

jfs · Accepted Answer

#!/usr/bin/env python
import cgi
import fileinput
import os
import shutil
import sys

def textfiles(rootdir, extensions=('.txt',)):
    for dirpath, dirs, files in os.walk(rootdir):
        for f in files:
            if f.lower().endswith(extensions):
               yield os.path.join(dirpath, f)

def htmlfiles(files):
    for f in files:
        root, _ = os.path.splitext(f)
        newf = root + '.html'
        shutil.copy2(f, newf)
        yield newf

for line in fileinput.input(htmlfiles(textfiles(sys.argv[1])), inplace=True):
    if fileinput.isfirstline():
       klass = 'aaa' # start head part
    line = cgi.escape(line.decode('utf-8').strip())
    line = line.encode('ascii', 'xmlcharrefreplace')
    if not line: # empty line
       klass = 'bbb' # start tail part
       print(line)
    else:
       print('%s' % (klass, line))

Example

$ python txt2html.py c:\root\dir

Python multiple regular expression replace

Answers (2)

Example

Related Questions