Reputation: 7877
Say I have a string containing data from a DB or spreadsheet in comma separated format.
For example:
data = "hello,how,are,you,232.3354,good morning"
Assume that there are maybe 200 fields in these "records".
I am interested in looking at just certain fields of this record. What is the fastest way in Python to get at them?
The most simple way would be something like:
fields = data.split(",")
result = [fields[4], fields[12], fields[123]]
Is there a faster way to do this, making use of the fact that:
I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution.
I am processing several million records so any speedup would be welcome.
Upvotes: 6
Views: 3187
Reputation: 13259
You're not going to do too much better than loading everything into memory and then dropping the parts that you need. My recommendation is compression and a better library.
As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines).
> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop
Dropping the columns is also pretty fast, I'm not sure what the major cost is.
> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop
Upvotes: 1
Reputation: 309909
If result
can be a tuple
instead of a list, you might gain a bit of a speedup (if you're doing multiple calls) using operator.itemgetter
:
from operator import itemgetter
indexer = itemgetter(4,12,123)
result = indexer(data.split(','))
You'd need to timeit
to actually see if you get a speedup or not though.
Upvotes: 0