user3798919
user3798919

Reputation: 29

Is there a more efficient way to load 10,000 excel rows into python?

I am loading data from excel to populate objects ie teacher(lastName, firstName, user_ID). These teacher objects are then put into a list, where further manipulation and analysis is done (ie looking for matches and other patterns). My approach to do this was use the xlrd module to load the data from excel.

for row in range(11) :
    lastName = sheet1.cell_value(row,3)
    firstName = sheet1.cell_value(row,1)
    userID = sheet1.cell_value(row,0)
    profList.append(tempProf)

So this method takes my computer like 20 seconds to do for just 12 rows and I need to do this for 10,000. My question- is there another way to load in data like this in a more efficient manner? I am open to out scoping, so if this question sounds stupid, I would love to hear alternatives, thanks

Upvotes: 0

Views: 770

Answers (3)

furas
furas

Reputation: 142651

You could use pandas to load data directly from excel to pandas DataFrame

import pandas as pd

df = pd.read_excel('filename')

and then you can use pandas for fast manipulation and analysis.

Lastly you can save it in Excel, CSV, JSON, HTML, LaTeX or maybe even directly in database.

So maybe you would not need some "teacher object".

pandas - Python Data Analysis Library
Pandas Cookbook
Pandas Lessons

Upvotes: 1

Roman K.
Roman K.

Reputation: 164

XLRD provides a row(row#) method to, which should save you 2 trips back to the file.

for row in range(11) :
    userID, firstName, junk, lastName = sheet1.row_values(row, 0, 3)

documentation here: https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966#sheet.Sheet.row_values-method

You want to work with rows over cells in most cases, taking a slice out of the data set and working with it is more efficient then going back to the data set each time.

10000 rows is quite a bit, so loading it in will take a while. You may want to consider a format like XML, or JSON which usually have efficient search methods in their libraries, or a flat DB engine like SQLite.

Once you hit more then a thousand or so rows its usually a good idea to move to a programmer friendly format or a database, since those are geared towards efficiency.

*I'm guessing from the code above that there are 4 columns, the third one is ignored, hence the junk variable.

Upvotes: 0

James Mills
James Mills

Reputation: 19030

An alternative is to convert your Excel spreadsheets to CSV.

Then use the Python csv module from the standard library.

from csv import reader as csv_reader

with open("data.csv", "r") as fd:
    for row in csv_reader(fd):
        userid, firstname, lastname = row

Explanation:

  1. Import the csv.reader object as csv_reader (my personal convention).
  2. Using a context manager and Python's builtin with, open data.csv for reading "r" placing the resulting file object into fd.
  3. Now iterate over the rows of data as provided by the CSV library handling which csv_reader(file_object) gives us.
  4. Each row is assumed to be of the form userid, firstname, lastname and so we can unpack the list row into referneces/variables. See: Tuples and Sequences

Also Note: If your data contains Unicode charaacters you will want to install and use the unicodecsv 3rd-party library from the Python Package Index (PyPi).

Upvotes: 5

Related Questions