Dave31415
Dave31415

Reputation: 2956

Best data structure: arrays of dictionaries, arrays of objects?

I am converting to python and numpy from IDL (kinda like Matlab). This is kinda an open question about handling data. Maybe someone can help.

The usual situation with my data is that I have a fixed class of data, perhaps from a spreadsheet, database etc. I am trying to figure out what kind of data structures are best to use in python and numpy.

I know about the csv module and can use csv.DictReader() to read a spreadsheet. This reads line by line and makes a dictionary with the proper names from the spreadsheet header (first line).

f=open(file,'rU')
dat = csv.DictReader(f)
i=0
data=[] # makes an empty list
i=0
for row in dat:
    data.append(row)
    if i == 0 :
        keys=row.keys()
        print "keys"
        print keys
        print
    i=i+1

f.close()

First of all, that is kinda a lot of code to read a csv file into a list of dictionaries and key the keys. Is there a faster/better way?

But now, I wonder whether an array of dictionaries is really what I want. Should I make a class of objects and make this an array of objects? Or something else?

If I have my array of dictionaries, "data", I would get some "column" like age=array([dat["age"] for dat in data])

Is that the right way to do it? Is there no way like "age=data->age" that would do it faster?

Would appreciate some guidance. Thanks.

Upvotes: 2

Views: 1531

Answers (4)

Joe Kington
Joe Kington

Reputation: 284672

Seeing as how you explicitly mention using numpy, consider something like the following:

import numpy as np
data = np.genfromtxt('data.txt', delimiter=',', names=True)
print data['item1']

Or

import numpy as np
item1, item2, item3 = np.loadtxt('data.txt', delimiter=',', skiprows=1).T

Where the format of data.txt is something along these lines (i.e. comma delimited).

item1, item2, item3
1.0, 2.0, 3.0
4.0, 5.0, 6.0
7.0, 8.0, 9.0

The first example uses structured arrays, while the second is just unpacking the columns (thus the transpose (.T)) into three variables.

Upvotes: 2

Thomas K
Thomas K

Reputation: 40340

If you're working with spreadsheet-type data a lot, I'd strongly recommend using pandas, a Python package designed for this sort of thing. You just do:

pandas.read_csv(file)

That gives you a DataFrame, which does all sorts of fancy indexing, and is nice and fast.

Upvotes: 5

John Zwinck
John Zwinck

Reputation: 249303

Doing it the way you are is OK, though your code can easily be made more concise:

data = list(csv.DictReader(open(file, 'rU')))
print "keys", data[0].keys()

Upvotes: 0

Michael
Michael

Reputation: 3628

I always go with arrays of objects

Upvotes: 0

Related Questions