Reputation: 7877

Fastest way to extract only certain fields from comma separated string in Python

Say I have a string containing data from a DB or spreadsheet in comma separated format.

For example:

data = "hello,how,are,you,232.3354,good morning"

Assume that there are maybe 200 fields in these "records".

I am interested in looking at just certain fields of this record. What is the fastest way in Python to get at them?

The most simple way would be something like:

fields = data.split(",")
result = [fields[4], fields[12], fields[123]]

Is there a faster way to do this, making use of the fact that:

You only need to allocate a list with 3 elements and 3 string objects for the result.
You can stop scanning the data string once you reach field 123.

I have tried to write some code using repeated calls to find to skip passed commas but if the last field is too far down the string this becomes slower than the basic split solution.

I am processing several million records so any speedup would be welcome.

Upvotes: 6

Answers (2)

U2EF1

Reputation: 13259

You're not going to do too much better than loading everything into memory and then dropping the parts that you need. My recommendation is compression and a better library.

As it happens I have a couple reasonably sized csv's lying around (this one is 500k lines).

> import gzip
> import pandas as pd
> %timeit pd.read_csv(gzip.open('file.csv.gz'))
1 loops, best of 3: 545 ms per loop

Dropping the columns is also pretty fast, I'm not sure what the major cost is.

> %timeit csv[['col1', 'col2']]
100 loops, best of 3: 5.5 ms per loop

Upvotes: 1

mgilson

Reputation: 309909

If result can be a tuple instead of a list, you might gain a bit of a speedup (if you're doing multiple calls) using operator.itemgetter:

from operator import itemgetter
indexer = itemgetter(4,12,123)
result = indexer(data.split(','))

You'd need to timeit to actually see if you get a speedup or not though.

Upvotes: 0

Fastest way to extract only certain fields from comma separated string in Python

Answers (2)

Related Questions