coredumped0x
coredumped0x

Reputation: 858

Python: How do I split a .txt file into two or more files with the same number of lines in each?

(I believe I have been looking for hours on stackexchange's and the internet, but couldn't find the right answer)

What I'm trying to do here is to count the number of lines a file has, I achieved that with this code here

# Does not loud into memory
def file_len(fname):
with open(fname) as f:
    for i, l in enumerate(f, 1):
        pass
    print(i)

file_len('bigdata.txt')

then I take the number of lines of the file and divide it by two/three/etc (to make two/three/etc files with the same amount of lines) e.g. bigdata.txt = 1000000 lines 1000000/2=500000 So here I will have two files with a 500000 lines in each, one starting from 1 to 500000 & the other from 500001 to 1000000. I already have this code which looks for a pattern in the original file(bigdata.txt), but I'm not looking for any pattern, just want to split the thing into two halfs or whatsover. Here is the code for it:

# Does not loud into memory
with open('bigdata.txt', 'r') as r:
with open('fhalf', 'w') as f:
    for line in r:
        if line == 'pattern\n': # Splits the file when there is an occurence of the pattern.
#But the occurence as you may notice won't be included in either the two files which is not a good thing since I need all the data.
            break
                f.write(line)
with open('shalf.txt', 'w') as f:
    for line in r:
        f.write(line)

So I'm looking for a simple solution and I know there is one, just can't figure it out for this moment. sample would be: file1.txt , file2.txt each with the same number lines give or take one. Thank you all for your time.

Upvotes: 1

Views: 76

Answers (1)

Joe Iddon
Joe Iddon

Reputation: 20434

Read in all the lines to a list with .readlines() and then calculate how many lines need to be distributed to each file and then get writing!

num_files = 2
with open('bigdata.txt') as in_file:
    lines = in_file.readlines()
    lines_per_file = len(lines) // num_files
    for n in range(num_files):
        with open('file{}.txt'.format(n+1), 'w') as out_file:
            for i in range(n * lines_per_file, (n+1) * lines_per_file):
                out_file.write(lines[i])

And a full test:

$ cat bigdata.txt 
line1
line2
line3
line4
line5
line6
$ python -q
>>> num_files = 2
>>> with open('bigdata.txt') as in_file:
...     lines = in_file.readlines()
...     lines_per_file = len(lines) // num_files
...     for n in range(num_files):
...         with open('file{}.txt'.format(n+1), 'w') as out_file:
...             for i in range(n * lines_per_file, (n+1) * lines_per_file):
...                 out_file.write(lines[i])
... 
>>> 
$ more file*
::::::::::::::
file1.txt
::::::::::::::
line1
line2
line3
::::::::::::::
file2.txt
::::::::::::::
line4
line5
line6

If you can't read bigdata.txt into memory then the .readlines() solution won't cut it.

You will have to write the lines as you read them which is no big deal.

As for working out the length in the first place, this question discusses some methods, my favourite being Kyle's sum() method.

num_files = 2
num_lines = sum(1 for line in open('bigdata.txt'))
lines_per_file = num_lines // num_files
with open('bigdata.txt') as in_file:
    for n in range(num_files):
        with open('file{}.txt'.format(n+1), 'w') as out_file:
            for _ in range(lines_per_file):
                out_file.write(in_file.readline())

Upvotes: 2

Related Questions