atricapilla
atricapilla

Reputation: 2630

How to iterate over files and replace text

I'm python beginner: how can I iterate over csv files in one directory and replace strings e.g.

ww into vv
.. into --

So, I do not want to replace lines having ww into vv, just those string on this line. I tried something like

#!/Python26/
# -*- coding: utf-8 -*-

import os, sys
for f in os.listdir(path):
    lines = f.readlines() 

But how to proceed?

Upvotes: 4

Views: 6759

Answers (3)

Randall Cook
Randall Cook

Reputation: 6776

See the other answers for information on replacing strings. I want to add more information about iterating files, the first part of the question.

If you want to recurse through a directory and all subdirectories, use os.walk(). os.listdir() does not recurse, or include the directory name in the filenames it generates. Use os.path.join() to form a more complete pathname.

Upvotes: 0

eumiro
eumiro

Reputation: 212885

import os
import csv

for filename in os.listdir(path):
    with open(os.path.join(path, filename), 'r') as f:
        for row in csv.reader(f):
            cells = [ cell.replace('www', 'vvv').replace('..', '--')
                      for cell in row ]
            # now you have a list of cells within one row
            # with all strings modified.

Edit: Is it for you to learn/practice Python or you just need to get the job done? In the latter case, use the sed program:

sed -i 's/www/vvv/g' yourPath/*csv
sed -i 's/\.\./,,/g' yourPath/*csv

Upvotes: 9

eyquem
eyquem

Reputation: 27575

As you want to make replacement of strings with strings of same length, the replacements can be done in place, that is to say rewriting only the bits that must be replaced , without having to record a new modified entire file.

So, with regex, this is very easy to do. The fact that the file is a CSV file has absolutely no importance in this method:

from os import listdir
from os.path import join
import re
pat = re.compile('ww|\.\.')
dicrepl = {'ww':'vv' , '..':'--'}

for filename in listdir(path):
    with open(join(path,filename),'rb+') as f:
        ch = f.read()
        f.seek(0,0)
        pos = 0
        for match in pat.finditer(ch):
            f.seek(match.start()-pos, 1)
            f.write(dicrepl[match.group()])
            pos = match.end()

It's absolutely necessary to open in binary mode to do such treatments: it's the 'b' in the mode 'rb+'.

The fact that the file is opened in mode 'r+' allows to read AND write at any desired place in it (if it was opened in 'a' we could only write at the end of the file)

But if the files are so big that the ch object will be too much memory consuming, it should be modified.

If the replacements would be of different length than the original strings, it's quasi obligatory to record a new file with the modifications made. (if length of the replacing strings are always less than length of the replaced strings, it's a particular case, and it's still possible to process without having to record a new file. It may be interesting on a big file)

The interest of doing f.seek(match.start()-pos, 1) instead of f.seek(match.start(), 0) is that it moves the pointer from position pos to position match.start() without the pointer having to be moved from position 0 to match.start() then from 0 to match.start() each time.

On the contrary, with f.seek(match.start(), 0) the pointer must be first moved back to position 0 ( beginning of the file) then be moved forward while counting the match.start() number of characters to stop at the right position match.start() because the seek(... , 0) means that the position is attained from the beginning of the file, while seek(... , 1) means that the moving is made from the CURRENT position. EDIT:

If you want to replace only the isolated 'ww' strings and not the 'ww' chunks in longer strings 'wwwwwww', the regex must be

pat = re.compile('(?<!w)ww(?!w)|(?<!\.)\.\.(?!\.)')

It's a possibility with regexes which can be obtained with replace() without tricky string manipulations.

EDIT:

I had forgotten the f.seek(0,0) instruction after f.read() . This instruction is necessary to move back the file's pointer to the beginning of the file, because during the reading the pointer is moved until the end.

I have corrected the code and now it works.

Here is a code to follow what's being processed:

from os import listdir
from os.path import join
import re
pat = re.compile('(?<!w)ww(?!w)|(?<!\.)\.\.(?!\.)')
dicrepl = {'ww':'vv' , '..':'ZZ'}

path = ...................................

with open(path,'rb+') as f:
    print "file has just been opened, file's pointer is at position ",f.tell()
    print '- reading of the file : ch = f.read()'
    ch = f.read()
    print "file has just been read"+\
          "\nfile's pointer is now at position ",f.tell(),' , the end of the file'
    print "- file's pointer is moved back to the beginning of the file : f.seek(0,0)"
    f.seek(0,0)
    print "file's pointer is now again at position ",f.tell()
    pos = 0
    print '\n- process of replacrement is now launched :'
    for match in pat.finditer(ch):
        print
        print 'is at position ',f.tell()
        print 'group ',match.group(),' detected on span ',match.span()
        f.seek(match.start()-pos, 1)
        print 'pointer having been moved on position ',f.tell()
        f.write(dicrepl[match.group()])
        print 'detected group have been replaced with ',dicrepl[match.group()]
        print 'now at position ',f.tell()
        pos = match.end()

Upvotes: 1

Related Questions