Jaffer Wilson
Jaffer Wilson

Reputation: 7273

Reading last N rows of a large csv in Pandas

I have file with 50 GB data. I know how to use Pandas for my data analysis.
I am only in need of the large 1000 lines or rows and in need of complete 50 GB.
Hence, I thought of using the nrows option in the read_csv().
I have written the code like this:

import pandas as pd
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=1000,index_col=0)

But it has taken the top 1000 rows. I am in need of the last 100 rows. So I did this and received error:

df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=-1000,index_col=0)
ValueError: 'nrows' must be an integer >=0

I have even tried using the chunksize in the read_csv(). But it still loads the complete file. And even the output was not DataFrame but iterables.

Hence, please let me know what I can in this scenario.

Please NOTE THAT I DO NOT WANT TO OPEN THE COMPLETE FILE...

Upvotes: 1

Views: 7099

Answers (4)

Serge Ballesta
Serge Ballesta

Reputation: 148965

The normal way would be to read the whole file and keep 1000 lines in a dequeue as suggested in the accepted answer to Efficiently Read last 'n' rows of CSV into DataFrame. But it may be suboptimal for a really huge file of 50GB.

In that case I would try a simple pre-processing:

  • open the file
  • read and discard 1000 lines
  • use ftell to have an approximation of what has been read so far
  • seek that size from the end of the file and read the end of file in a large buffer (if you have enough memory)
  • store the positions of the '\n' characters in the buffer in a dequeue of size 1001 (the file has probably a terminal '\n'), let us call it deq
  • ensure that you have 1001 newlines, else iterate with a larger offset
  • load the dataframe with the 1000 lines contained in the buffer:

    df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
    

Code could be (beware: untested):

with open("Analysis_of_50GB.csv", "r", encoding="utf-16") as fd:
    for i in itertools.islice(fd, 1250):      # read a bit more...
        pass
    offset = fd.tell()
    while(True):
        fd.seek(-offset, os.SEEK_END)
        deq = collection.deque(maxlen = 1001)
        buffer = fd.read()
        for i,c in enumerate(buffer):
            if c == '\n':
                deq.append(i)
        if len(deq) == 1001:
            break
        offset = offset * 1250 // len(deq)

df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))

Upvotes: 1

Loochie
Loochie

Reputation: 2472

I think you need to use skiprows and nrows together. Assuming that your file has 1000 rows, then,

df =pd.read_csv('"Analysis_of_50GB.csv", encoding="utf16",skiprows = lambda x: 0<x<=900, nrows=1000-900,index_col=0)

reads all the rows from 901 to 1000.

Upvotes: 1

EdChum
EdChum

Reputation: 394051

A pure pandas method:

import pandas as pd
line = 0
chksz = 1000
for chunk in pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",chunksize = chksz,index_col=0, usecols=0):
    line += chunk.shape[0]

So this just counts the the number of rows, we read just the first column for performance reasons.

Once we have the total number of rows we just subtract from this the number of rows we want from the end:

df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16", skiprows = line - 1000,index_col=0)

Upvotes: 3

Auss
Auss

Reputation: 491

You should consider using dask which does chunking under the hood and allows you to work with very large data frames. It has a very similar workflow as pandas and the most important functions are already implemented.

Upvotes: 1

Related Questions