luisfer
luisfer

Reputation: 2120

Python/Pandas: How can I read 7 million records?

I was given a "database" (more correctly an ugly huge CSV file) that contains the results of a "discovery" process. The rows I get are very short, they are information about licensing on over 65,000 computers, it looks like:

10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Publisher MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Outlook MUI (Spanish) 2010;14.0.7015.1000;20160216;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Groove MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Word MUI (Spanish) 2010;14.0.7015.1000;20151119;Microsoft Corporation

As you see is a semicolon separated file, it has the time when the process was run, the PC's id, a blank (I don't know what it is), the program, and version program, there are more fields, but I don't care about them, only those ones are relevant.

So I turn to Pandas to do some analysis (basically counting), and got around 3M records. Problem is, this file is over 7M records (I looked at it using Notepad++ 64bit). So, how can I use Pandas to analyze a file with so many records?

I'm using Python 3.5, Pandas 0.19.2

Adding info for Fabio's comment:

I'm using:

df = pd.read_csv("inventario.csv", delimiter=";", 
                 header=None, usecols=[0,1,2,3,4], 
                 encoding="latin_1")

To be very precise: the file is 7'432,175 rows, Pandas is only accessing 3'172,197. Something curious is that if I load the file into Excel 2017 (using a data query) it will load exactly 3'172,197 rows.

EDIT: After the comments, I checked the file and found some lines are corrupted (around 450), I don't know if they were signaling and end of file, it doesn't look so, anyway, I cleaned the wrong-formed lines, and still Pandas read only around 3M lines.

EDIT:

OK, I solved the problem, but really, help me understand what I did wrong. I can't be doing things like I did... First, I cleaned the file for "strange" lines, they were around 500 of them, and then I saved the file to inv.csv

Then I did the following:

f_inventario = open("inv.csv", "r", encoding="latin1")
f_inventario.readlines()
f_inventario.close()

df = pd.DataFrame(lines)

df.columns = ['data']

df['fecha']    = df.data.apply(lambda s : s.split(';')[0])
df['equipo']   = df.data.apply(lambda s : s.split(';')[1])
df['software'] = df.data.apply(lambda s : s.split(';')[2])
df['version']  = df.data.apply(lambda s : s.split(';')[3][:-1])

df.drop(['data'], axis=1, inplace=True)

And now I got my dataframe with the 7M rows. If I did a df=pd.read_csv('inv.csv' ... ) it would only read about 3M records.

I got my problem solved, but this is terrible, this is not how it should be. As I see it is not a memory problem. Could it be some global variable that tells read_csv to load up to a maximum??? I really don't know.

Upvotes: 1

Views: 4571

Answers (1)

fourbluecrayons
fourbluecrayons

Reputation: 43

If performance is not an issue, a trivial approach would be to simply read the file line by line in to a buffer. Analyze the data in the buffer once the buffer is full. Continue this iteratively until you have processed the entire file. Once that's done you can then aggregate the results from each chunk to form your final result. To speed things up, you can look in to something like memory mapping, something like

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(map.readline())

see this thread

Upvotes: 1

Related Questions