Reputation: 2819
I have some data in the following format:
56.00 101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
56.00 100.74 52.10 100.74 56.000000 100.740000 3
56.00 102.96 52.40 102.96 56.000000 102.960000 4
56.00 100.74 55.40 100.74 56.000000 100.740000 5
56.00 103.70 54.80 103.70 56.000000 103.700000 6
56.00 101.85 53.00 101.85 56.000000 101.850000 7
56.00 102.22 52.10 102.22 56.000000 102.220000 8
56.00 101.11 55.40 101.11 56.000000 101.110000 9
56.00 101.11 54.80 101.11 56.000000 101.110000 10
56.00 101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
........
What I need are the data for a specific id
(last column).
With numpy
I used to do:
d=loatxt('filename')
wanted = d[ d[:,6]==id ]
Now I' m learning Pandas and found out, that pandas.read_csv()
is really faster that loadtxt()
.
So logically I was wondering if there is a possibility to do he same filtering with pandas (maybe it is even faster).
My first thought was trying groupby
as follows:
p=pd.read_csv('filename', sep= ' ', header=None, names=['a', 'b', 'x', 'y', 'c', 'd', 'id'])
d = p.groupby(['id'])
#[ i, g in p.groupby(['id']) if i ==1] # syntax error, why?
The question is: Is there a relatively easy way to do the selection from p
of the rows of let's say id==1
?
EDIT
Trying the proposed solution:
%timeit t_1 = n[ n[:,6]==1 ][:,2:4]
10 loops, best of 3: 60.8 ms per loop
%timeit t_2 = p[ p['id'] == 1 ][['x', 'y']]
10 loops, best of 3: 70.9 ms per loop
It seems that numpy
is here a bit faster that Pandas
That means the fastest way to work in this case is:
1) First read the data with Pandas read_csv
2) Convert the data to numpy.array
3) and than the work.
Is this conclusion correct?
Upvotes: 0
Views: 155
Reputation: 139342
You can do just the same as you did with numpy, just now refering to the column by its name:
wanted = d[d['id'] == id]
Upvotes: 1