Reputation: 259

Only read certain rows in a csv file with python

I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.

Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading. Can i somehow achieve this with seek()?

For example: Seek to row 50 read from 50 to 60

next time: seek to row 27 read 27 to 34 and so on

So not only seeking continuesly forward through the file but also backwards.

Thank you a lot

Upvotes: 7

Answers (5)

Rick

Reputation: 45251

If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.

As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:

with open(file_name) as f:
    for line,_ in zip(f, range(50)):
        pass
    lines = [line for line,_ in zip(f, range(10))]

...or to read a character at a time using seek, and count new line characters. But it is certainly MUCH more convenient to do the first.

However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:

from itertools import accumulate
with open(file_name) as f:
    cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))

This would allow you to seek to any line number in the file without processing the whole thing ever again:

def seek_line(path, line_num, cum_lens):
    with open(path) as f:
        f.seek(cum_lens[line_num], 0)
        return f.readline()

class LineX:
    """A file reading object that can quickly obtain any line number."""
    def __init__(self, path, cum_lens):
        self.cum_lens = cum_lens
        self.path = path
    def __getitem__(self, i):
        return seek_line(self.path, i, self.cum_lens)

linex = LineX(file_name, cum_lens)
line50 = linex[50]

But at this point, you might be better off loading the file contents into some kind of database. It depends on what you're trying to do, and what kind of data the file contains.

Upvotes: 4

Demont Zhang

Reputation: 224

You can use chunksize

import pandas as pd

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Upvotes: 5

Phoenix Flight

Reputation: 45

its that easy:

with open("file.csv", "r") as file:
    print(file.readlines()[50:60])

Upvotes: 1

Espoir Murhabazi

Reputation: 6376

As others are saying the most obvious solution is to use pandas read csv ! The method has a parameter called skiprows:

from the doc there is what is said :

skiprows : list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

You can have something like this :

import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))

Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial

he said :

The parameter essentially means the number of rows to be read into a dataframe at any single time in order to fit into the local memory. Since the data consists of more than 70 millions of rows, I specified the chunksize as 1 million rows each time that broke the large data set into many smaller pieces.

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)

You can try this and iterate over the chunk to retrieve only the rows you are looking for.

The function should return true if the row number is in the specified list

Upvotes: 3

Dylan_w

Reputation: 482

An option would be to use Pandas. For example:

import pandas as pd
# Select file 
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)

Upvotes: 8

Only read certain rows in a csv file with python

Answers (5)

Related Questions