Reputation: 259
I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.
Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading. Can i somehow achieve this with seek()?
For example: Seek to row 50 read from 50 to 60
next time: seek to row 27 read 27 to 34 and so on
So not only seeking continuesly forward through the file but also backwards.
Thank you a lot
Upvotes: 7
Views: 43232
Reputation: 45251
If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.
As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:
with open(file_name) as f:
for line,_ in zip(f, range(50)):
pass
lines = [line for line,_ in zip(f, range(10))]
...or to read a character at a time using seek
, and count new line characters. But it is certainly MUCH more convenient to do the first.
However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict
of the line lengths ahead of time:
from itertools import accumulate
with open(file_name) as f:
cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))
This would allow you to seek to any line number in the file without processing the whole thing ever again:
def seek_line(path, line_num, cum_lens):
with open(path) as f:
f.seek(cum_lens[line_num], 0)
return f.readline()
class LineX:
"""A file reading object that can quickly obtain any line number."""
def __init__(self, path, cum_lens):
self.cum_lens = cum_lens
self.path = path
def __getitem__(self, i):
return seek_line(self.path, i, self.cum_lens)
linex = LineX(file_name, cum_lens)
line50 = linex[50]
But at this point, you might be better off loading the file contents into some kind of database. It depends on what you're trying to do, and what kind of data the file contains.
Upvotes: 4
Reputation: 224
You can use chunksize
import pandas as pd
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Upvotes: 5
Reputation: 45
its that easy:
with open("file.csv", "r") as file:
print(file.readlines()[50:60])
Upvotes: 1
Reputation: 6376
As others are saying the most obvious solution is to use pandas read csv ! The method has a parameter called skiprows:
from the doc there is what is said :
skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
You can have something like this :
import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))
Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial
he said :
The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory. Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
You can try this and iterate over the chunk to retrieve only the rows you are looking for.
The function should return true if the row number is in the specified list
Upvotes: 3
Reputation: 482
An option would be to use Pandas. For example:
import pandas as pd
# Select file
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)
Upvotes: 8