Reputation: 3014
I have been trying to do some text manipulation in Python and am running into a lot of issues, mainly due a fundamental misunderstanding of how file manipulation works in Python so I am hoping to clear that up.
So lets say I'm iterating through a text file called "my.txt" and it has the following contents:
3 10 7 8
2 9 8 3
4 1 4 2
The code I'm using to iterate through the file is:
file = open ("my.txt", 'r')
for line in file:
print line`
I copied and pasted the above code from a tutorial. I know what it does but I don't know why it works and it's bothering me. I am trying to understand exactly what the variable "line" represents in the file. Is it a data type(a string?) or something else. My instinct tells me that each line represents a string which could then be manipulated(which is what I want) but I also understand that strings are immutable in Python.
What role is memory playing into all this, if my file is too big to fit into memory will it still work? Will line[3] allow me to access the fourth element in each line? If I only want to work on the second line can I do:
if line == 2:
within the for loop?
It might be worth noting that I am pretty new to Python and am coming from a C\C++ background(not used to immutable strings). I know I squeezed quite a few questions into one but any clarification on the general topic would really be helpful :)
Upvotes: 1
Views: 437
Reputation: 104014
Suppose you have your same file:
3 10 7 8\n
2 9 8 3\n
4 1 4 2\n
There are many file methods that operate on a file object
In Python, you can read a file character by character, C style:
with open('/tmp/test.txt', 'r') as fin: # fin is a 'file object'
while True:
ch=fin.read(1)
if not ch:
break
print ch, # comma suppresses the CR
You can read the whole file as a single string:
with open('/tmp/test.txt', 'r') as fin:
data=fin.read()
print data
As enumerated lines:
with open('/tmp/test.txt', 'r') as fin:
for i, line in enumerate(fin):
print i, line
As a list of strings:
with open('/tmp/test.txt', 'r') as fin:
data=fin.readlines()
The idiom of looping over a file object:
for line in fin: # 'fin' is a file object result of open
print line
is synonymous with:
for line in fin.readline():
print line
and similar to:
for line in 'line 1\nline 2\nline 3'.splitlines():
print line
Once you get used to the Python style loops (or Perl, or Obj C, or Java range style loops) that loop over the elements of something -- you use them without thinking about it much.
If you want the index of each item -- use enumerate
Upvotes: 1
Reputation: 17510
You can iterate over a file of any size, with the code you have shown, and it should not consume any significant amount of memory beyond the size of the longest single line.
As for how it works, under the hood, you could dive into the source code for Python itself to learn the gory details. At a higher level just consider that the implementor of file objects, in Python, chose to implement line-by-line iteration as a feature of their class.
Many of the collection data types and I/O interfaces in Python implement some form of iteration. Thus the for
construct is the most common type of looping in Python. You can iterate over lists, tuples, and sets (by item), strings (by character), dictionaries (by key), and many classes (including those in the standard libraries as well as those from third parties) implement the
"iterator (coding) protocol" to facilitate such usage.
Upvotes: 0
Reputation: 11060
In Python, you can iterate straight over a file. The best way of doing this is with a with
statement, as in:
with open("myfile.txt") as f:
for i in f:
# do stuff to each line in the file
The lines are strings representing each line (seperated by newlines) in the file. If you only want to operate on the second line, you could do something like this:
with open("myfile.txt") as f:
list_of_file = list(f)
second_line = list_of_file[2]
If you then want to access part of the second line you can split it by spaces into another list as so:
second_number_in_second_line = second_line.split()[1]
With regards to memory, iterating through the file directly does not read it all into memory, however, turning it into a list
does. If you want to access individual lines without doing so, use itertools.islice
.
Upvotes: 2
Reputation: 35901
In each iteration the line
variable is filled with contents of subsequent lines read from the file. So, you'll have:
"3 10 7 8" in first iteration
"2 9 8 3" in second iteration
etc.
To get the numbers separately, use the split method: link.
So comparing line
with 2
doesn't make sens. If you want to identify line numbers, you can try:
lineNumber = 0
for line in file:
print line
if lineNumber == 2:
print "that was the second line!"
lineNumber += 1
As suggested in the comment, you can simplify this by using enumerate:
for lineNumber, line in enumerate(file):
print line
if lineNumber == 2:
print "that was the second line!"
Upvotes: 1
Reputation: 281476
line
is a line of text, represented as a string. Strings are immutable, but that's not an issue for manipulating them; all variables in Python are references, and assigning to a variable points the reference to a new object. (In C++, you can't change where a reference points.) Iterating over a file iterates over the lines, so on each iteration, line
refers to a new string representing the next line of the input file.
If you're familiar with range-based for loops or other language's for-each constructs, that's how Python's for
works. The loop variable is not a counter; you can't do
if line == 2:
because line
isn't the index of the line; it's the line itself. You could do
for i, line in enumerate(f):
if i == 2:
do_stuff_with(line)
break # No need to load the rest of the file
Note that file
is the name of a builtin, so it's a bad idea to use that name for your own variables.
Upvotes: 3