Andreanna
Andreanna

Reputation: 245

Using python to write specific lines from one file to another file

I have ~200 short text files (50kb) that all have a similar format. I want to find a line in each of those files that contains a certain string and then write that line plus the next three lines (but not rest of the lines in the file) to another text file. I am trying to teach myself python in order to do this and have written a very simple and crude little script to try this out. I am using version 2.6.5, and running the script from Mac terminal:

#!/usr/bin/env python

f = open('Test.txt')

Lines=f.readlines()
searchquery = 'am\n'
i=0

while i < 500:
    if Lines[i] == searchquery:
        print Lines[i:i+3]
        i = i+1
    else:
        i = i+1
f.close()

This more or less works and prints the output to the screen. But I would like to print the lines to a new file instead, so I tried something like this:

f1 = open('Test.txt')
f2 = open('Output.txt', 'a')

Lines=f1.readlines()
searchquery = 'am\n'
i=0

while i < 500:
if Lines[i] == searchquery:
    f2.write(Lines[i])
    f2.write(Lines[i+1])
    f2.write(Lines[i+2])
    i = i+1
else:
    i = i+1
f1.close()
f2.close()

However, nothing is written to the file. I also tried

from __future__ import print_function
print(Lines[i], file='Output.txt')

and can't get that to work, either. If anyone can explain what I'm doing wrong or offer some suggestions about what I should try instead I would be really grateful. Also, if you have any suggestions for making the search better I would appreciate those as well. I have been using a test file where the string I want to find is the only text on the line, but in my real files the string that I need is still at the beginning of the line but followed by a bunch of other text, so I think the way I have things set up now won't really work, either.

Thanks, and sorry if this is a super basic question!

Upvotes: 13

Views: 88996

Answers (5)

computerist
computerist

Reputation: 942

Writing line by line can be slow when working with large data. You can accelerate the read/write operations by reading/writing a bunch of lines at once.

from itertools import slice

f1 = open('Test.txt')
f2 = open('Output.txt', 'a')

bunch = 500
lines = list(islice(f1, bunch)) 
f2.writelines(lines)

f1.close()
f2.close()

In case your lines are too long and depending on your system, you may not be able to put 500 lines in a list. If that's the case, you should reduce the bunch size and have as many read/write steps as needed to write the whole thing.

Upvotes: 0

Lukas Graf
Lukas Graf

Reputation: 32590

As pointed out by @ajon, I don't think there's anything fundamentally wrong with your code except the indentation. With the indentation fixed it works for me. However there's a couple opportunities for improvement.

1) In Python, the standard way of iterating over things is by using a for loop. When using a for loop, you don't need to define loop counter variables and keep track of them yourself in order to iterate over things. Instead, you write something like this

for line in lines:
    print line

to iterate over all the items in a list of strings and print them.

2) In most cases this is what your for loops will look like. However, there's situations where you actually do want to keep track of the loop count. Your case is such a situation, because you not only need that one line but also the next three, and therefore need to use the counter for indexing (lst[i]). For that there's enumerate(), which will return a list of items and their index over which you then can loop.

for i, line in enumerate(lines):
    print i
    print line
    print lines[i+7]

If you were to manually keep track of the loop counter as in your example, there's two things:

3) That i = i+1 should be moved out of the if and else blocks. You're doing it in both cases, so put it after the if/else. In your case the else block then doesn't do anything any more, and can be eliminated:

while i < 500:
    if Lines[i] == searchquery:
        f2.write(Lines[i])
        f2.write(Lines[i+1])
        f2.write(Lines[i+2])
    i = i+1

4) Now, this will cause an IndexError with files shorter than 500 lines. Instead of hard coding a loop count of 500, you should use the actual length of the sequence you're iterating over. len(lines) will give you that length. But instead of using a while loop, use a for loop and range(len(lst)) to iterate over a list of the range from zero to len(lst) - 1.

for i in range(len(lst)):
    print lst[i]

5) open() can be used as a context manager that takes care of closing files for you. context managers are a rather advanced concept but are pretty simple to use if they're already provided for you. By doing something like this

with open('test.txt') as f:
    f.write('foo')

the file will be opened and accessible to you as f inside that with block. After you leave the block the file will be automatically closed, so you can't end up forgetting to close the file.

In your case you're opening two files. This can be done by just using two with statements and nest them

with open('one.txt') as f1:
    with open('two.txt') as f2:
        f1.write('foo')
        f2.write('bar')

or, in Python 2.7 / Python 3.x, by nesting two context manager in a single with statement:

    with open('one.txt') as f1, open('two.txt', 'a') as f2:
        f1.write('foo')
        f2.write('bar')

6) Depending on the operating system the file was created on, line endings are different. On UNIX-like platforms it's \n, Macs before OS X used \r, and Windows uses \r\n. So that Lines[i] == searchquery will not match for Mac or Windows line endings. file.readline() can deal with all three, but because it keeps whatever line endings were there at the end of the line, the comparison will fail. This is solved by using str.strip(), which will strip the string of all whitespace at the beginning and the end, and compare a search pattern without the line ending to that:

searchquery = 'am'
# ...
            if line.strip() == searchquery:
                # ...

(Reading the file using file.read() and using str.splitlines() would be another alternative.)

But, since you mentioned your search string actually appears at the beginning of the line, lets do that, by using str.startswith():

if line.startswith(searchquery):
    # ...

7) The official style guide for Python, PEP8, recommends to use CamelCase for classes, lowercase_underscore for pretty much everything else (variables, functions, attributes, methods, modules, packages). So instead of Lines use lines. This is definitely a minor point compared to the others, but still worth getting right early on.


So, considering all those things I would write your code like this:

searchquery = 'am'

with open('Test.txt') as f1:
    with open('Output.txt', 'a') as f2:
        lines = f1.readlines()
        for i, line in enumerate(lines):
            if line.startswith(searchquery):
                f2.write(line)
                f2.write(lines[i + 1])
                f2.write(lines[i + 2])

As @TomK pointed out, all this code assumes that if your search string matches, there's at least two lines following it. If you can't rely on that assumption, dealing with that case by using a try...except block like @poorsod suggested is the right way to go.

Upvotes: 27

whardier
whardier

Reputation: 705

Have you tried using something other than 'Output.txt' to avoid any filesystem related issues as the problem?

What about an absolute path to avoid any funky unforeseen problems while diagnosing this.

This advice is simply from a diagnostic standpoint. Also check out the the OS X dtrace and dtruss.

See: Equivalent of strace -feopen < command > on mac os X

Upvotes: 1

TomK
TomK

Reputation: 363

ajon has the right answer, but so long as you are looking for guidance, your solution doesn't take advantage of the high-level constructs that Python can offer. How about:

searchquery = 'am\n'

with open('Test.txt') as f1:
  with open(Output.txt, 'a') as f2:

    Lines = f1.readlines()

    try:
      i = Lines.index(searchquery)
      for iline in range(i, i+3):
        f2.write(Lines[iline])
    except:
      print "not in file"

The two "with" statements will automatically close the files at the end, even if an exception happens.

A still better solution would be to avoid reading in the whole file at once (who knows how big it could be?) and, instead, process line by line, using iteration on a file object:

  with open('Test.txt') as f1:
    with open(Output.txt, 'a') as f2:
      for line in f1:
        if line == searchquery:
          f2.write(line)
          f2.write(f1.next())
          f2.write(f1.next())

All of these assume that there are at least two additional lines beyond your target line.

Upvotes: 1

alanmanderson
alanmanderson

Reputation: 8200

I think your problem is the tabs of the bottom file.

You need to indent from if Lines[i] until after i=i+1 such as:

while i < 500:
    if Lines[i] == searchquery:
        f2.write(Lines[i])
        f2.write(Lines[i+1])
        f2.write(Lines[i+2])
        i = i+1
    else:
        i = i+1

Upvotes: 3

Related Questions