Reputation: 7273
I have file with 50 GB data. I know how to use Pandas for my data analysis.
I am only in need of the large 1000 lines or rows and in need of complete 50 GB.
Hence, I thought of using the nrows
option in the read_csv()
.
I have written the code like this:
import pandas as pd
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=1000,index_col=0)
But it has taken the top 1000 rows. I am in need of the last 100 rows. So I did this and received error:
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",nrows=-1000,index_col=0)
ValueError: 'nrows' must be an integer >=0
I have even tried using the chunksize
in the read_csv()
. But it still loads the complete file. And even the output was not DataFrame
but iterables
.
Hence, please let me know what I can in this scenario.
Please NOTE THAT I DO NOT WANT TO OPEN THE COMPLETE FILE...
Upvotes: 1
Views: 7099
Reputation: 148965
The normal way would be to read the whole file and keep 1000 lines in a dequeue as suggested in the accepted answer to Efficiently Read last 'n' rows of CSV into DataFrame. But it may be suboptimal for a really huge file of 50GB.
In that case I would try a simple pre-processing:
deq
load the dataframe with the 1000 lines contained in the buffer:
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
Code could be (beware: untested):
with open("Analysis_of_50GB.csv", "r", encoding="utf-16") as fd:
for i in itertools.islice(fd, 1250): # read a bit more...
pass
offset = fd.tell()
while(True):
fd.seek(-offset, os.SEEK_END)
deq = collection.deque(maxlen = 1001)
buffer = fd.read()
for i,c in enumerate(buffer):
if c == '\n':
deq.append(i)
if len(deq) == 1001:
break
offset = offset * 1250 // len(deq)
df = pd.read_csv(io.StringIO(buffer[d[0]+1:]))
Upvotes: 1
Reputation: 2472
I think you need to use skiprows and nrows together. Assuming that your file has 1000 rows, then,
df =pd.read_csv('"Analysis_of_50GB.csv", encoding="utf16",skiprows = lambda x: 0<x<=900, nrows=1000-900,index_col=0)
reads all the rows from 901 to 1000.
Upvotes: 1
Reputation: 394051
A pure pandas method:
import pandas as pd
line = 0
chksz = 1000
for chunk in pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16",chunksize = chksz,index_col=0, usecols=0):
line += chunk.shape[0]
So this just counts the the number of rows, we read just the first column for performance reasons.
Once we have the total number of rows we just subtract from this the number of rows we want from the end:
df = pd.read_csv("Analysis_of_50GB.csv",encoding="utf-16", skiprows = line - 1000,index_col=0)
Upvotes: 3