MW2015
MW2015

Reputation: 11

Fastest way to extract part of a long string in Python

I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:

my_token:[
  "key_of_interest"
],

This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.

Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.

Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.

Upvotes: 1

Views: 2860

Answers (3)

bignose
bignose

Reputation: 32279

The underlying requirement shows through when you clarify:

I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.

That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.

There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.

So, use libraries written by people who have dealt with the issues before you.

  • If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
  • If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
  • If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.

So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.

Upvotes: 1

AbdealiLoKo
AbdealiLoKo

Reputation: 3317

Well, as already mentioned - a parser seems the best option.

But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.

matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str)
key_of_interest = matches.groups()[0]

There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.

And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.

Upvotes: 0

ivan_pozdeev
ivan_pozdeev

Reputation: 35998

How can this can possibly be done the "dumbest and simplest way"?

  • find the starting position
  • look on for the ending position
  • grab everything indiscriminately between the two

This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:

  • narrow down the search region (requires additional constraints/assumptions as per comment56995056)
  • speed up the search operation bits, which include:
    • extracting raw data from the format
      • you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
    • elementary pattern comparison operation
      • unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be

Upvotes: 1

Related Questions