Python/Pandas: How can I read 7 million records?

Question

I was given a "database" (more correctly an ugly huge CSV file) that contains the results of a "discovery" process. The rows I get are very short, they are information about licensing on over 65,000 computers, it looks like:

10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Publisher MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Outlook MUI (Spanish) 2010;14.0.7015.1000;20160216;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Groove MUI (Spanish) 2010;14.0.7015.1000;20150722;Microsoft Corporation
10/02/2017 09:14:56 a.m.;0000GATMEX39388; ;Microsoft Office Word MUI (Spanish) 2010;14.0.7015.1000;20151119;Microsoft Corporation

As you see is a semicolon separated file, it has the time when the process was run, the PC's id, a blank (I don't know what it is), the program, and version program, there are more fields, but I don't care about them, only those ones are relevant.

So I turn to Pandas to do some analysis (basically counting), and got around 3M records. Problem is, this file is over 7M records (I looked at it using Notepad++ 64bit). So, how can I use Pandas to analyze a file with so many records?

I'm using Python 3.5, Pandas 0.19.2

Adding info for Fabio's comment:

I'm using:

df = pd.read_csv("inventario.csv", delimiter=";", 
                 header=None, usecols=[0,1,2,3,4], 
                 encoding="latin_1")

To be very precise: the file is 7'432,175 rows, Pandas is only accessing 3'172,197. Something curious is that if I load the file into Excel 2017 (using a data query) it will load exactly 3'172,197 rows.

EDIT: After the comments, I checked the file and found some lines are corrupted (around 450), I don't know if they were signaling and end of file, it doesn't look so, anyway, I cleaned the wrong-formed lines, and still Pandas read only around 3M lines.

EDIT:

OK, I solved the problem, but really, help me understand what I did wrong. I can't be doing things like I did... First, I cleaned the file for "strange" lines, they were around 500 of them, and then I saved the file to inv.csv

Then I did the following:

f_inventario = open("inv.csv", "r", encoding="latin1")
f_inventario.readlines()
f_inventario.close()

df = pd.DataFrame(lines)

df.columns = ['data']

df['fecha']    = df.data.apply(lambda s : s.split(';')[0])
df['equipo']   = df.data.apply(lambda s : s.split(';')[1])
df['software'] = df.data.apply(lambda s : s.split(';')[2])
df['version']  = df.data.apply(lambda s : s.split(';')[3][:-1])

df.drop(['data'], axis=1, inplace=True)

And now I got my dataframe with the 7M rows. If I did a df=pd.read_csv('inv.csv' ... ) it would only read about 3M records.

I got my problem solved, but this is terrible, this is not how it should be. As I see it is not a memory problem. Could it be some global variable that tells read_csv to load up to a maximum??? I really don't know.

Python/Pandas: How can I read 7 million records?

Answers (1)

Related Questions