Reputation: 21453
How to copy all even lines from one file to a new file in Python?
The even number is just an illustration when I want a very select though substantive number of lines copied from one file to another, but it should be good as an example.
I use this, but it is very inefficient (it takes around 5 minutes):
# foo.txt holds 200,000 lines with 300 values
list = [0, 2, 4, 6, 8, 10..... 199996, 199998]
newfile = open(savefile, "w")
with open("foo.txt", "r") as file:
for i, line in enumerate(file):
if i in list:
newfile.write(line)
newfile.close()
I would also appreciate it if there's an explanation why this is so slow: reading line by line goes quickly (around 15 seconds), and is also advised by the manual.
EDIT: My apologies; I am not looking for specific odd/even examples; it is merely for the effect of how to deal with around 100k out of 200k values in no easy order. Is there no general solution to the I/O problem here other than finding more efficient ways to deal with odd/even? Again apologies for bringing it up.
Upvotes: 1
Views: 2507
Reputation: 17168
You're spending tons of time creating and then repeatedly searching (on every line!!!) that monstrous list
. Just read the first file line by line and skip every other. You can either do this with a toggling flag, or just check if the line number is divisible by two (clearer, in my opinion).
for i, line in enumerate(file):
if i % 2 == 0:
newfile.write(line)
EDIT in response to your edit: your question is now "how to copy arbitrary lines from a file?" That depends an awful lot on how those arbitrary lines are defined. The answer still is definitely not to use a list of "wanted" line numbers, because searching that list takes a long time, and you'll have to search it on every line.
If the goal is essentially to be able to pick random lines from the file, you could use something similar to your current setup, but using set
instead of list
to make your lookup fast. A general-case proof-of-concept solution might look like this:
import random
# Pick 5000 random lines
wanted_lines = set(random.sample(range(200000), 5000)) # Use a set!
for i, line in enumerate(file):
if i in wanted_lines: # average-case O(1)
newfile.write(str(line)+'\n')
Upvotes: 1
Reputation: 4648
I'm assuming that your list
is predefined, and can contain any sequence of possible line indices, not necessarily every Nth line for example.
The first probable bottleneck is that you're doing a O(n) list search (i in list
) 200000 times. Converting the list to a dictionary should already help:
listd = dict.fromkeys(list)
.
.
# this is O(1) instead of O(n)
if i in listd:
Alternatively, if you know that list
is sorted, or you can sort it, simply keep track of the next line index:
list = [0, 2, 4, 6, 8, 10..... 199996, 199998]
nextidx = 0
newfile = open(savefile, "w")
with open("foo.txt", "r") as file:
for i, line in enumerate(file):
if i == list[nextidx]:
newfile.write(line)
nextidx += 1
newfile.close()
Upvotes: 1
Reputation: 11737
something like this?
flag = False
with open("test_async_db_access.py", "r") as file:
for line in file:
if flag:
print line
flag = not flag
This avoids having to use the large list
Edit: If it is an arbitrary list of lines you want then use a map {} like DSM's answer this will perform the 'in' in O(1) time instead of O(n).
Upvotes: 0
Reputation: 353079
What's taking all the time is searching list
. In order to figure out whether i
is in list
, it has to scan through the entire list to be sure that it's not there. If you really only care about even numbers, you can simply use if i % 2 == 0
, but if you have a specific group of line numbers you want, you should use a set
, which has O(1) membership testing, e.g.
keep = {1, 5, 888, 20203}
and then
if i in keep:
Upvotes: 3