Reputation: 47
Consider the following: "MULTILINESTRING((10 10,10 40),(40 40,30 30,40 20,30 10))"
.
I want to transform this into: [[10,10],[10,40],[40,40],[30,30],[40,20],[30,10]]
.
My solution
I use the functions split()
and replace()
to format this. I get some dirty code and probably not the most efficient like my_str.split('((')[1].split('))')[1]...etc
Because I'm doing this on a huge dataset, I'm looking for an efficient way to do it.
Upvotes: 2
Views: 583
Reputation: 402393
If you're looking for clean code that doesn't do too much, I'd recommend a two step process involving the re
module—
str.split
re.findall
For performance, I'd recommend pre-compiling a regex-pattern using re.compile
, since we'll be calling it repeatedly inside a loop.
>>> import re
>>> p = re.compile(r'\d+(?:\.\d+)?')
>>> [list(map(int, p.findall(x)) for x in mstring.split(',')]
[[10, 10], [10, 40], [40, 40], [30, 30], [40, 20], [30, 10]]
Note, mstring
is your string data.
Details
\d+ # match one or more digits
(?: # specify non-capturing group
\. # literal period/decimal
\d+
)? # optional
Semantically, this regex will match integers OR floats (Ajax1234's solution currently only accounts for integers, and is guaranteed to be finish searching in fewer cycles).
Upvotes: 2
Reputation: 71451
You can use re
:
import re
s = 'MULTILINESTRING((10 10,10 40),(40 40,30 30,40 20,30 10))'
final_result = list(filter(None, [list(map(int, i.split())) for i in re.findall('[\d\s]+', s)]))
Output:
[[10, 10], [10, 40], [40, 40], [30, 30], [40, 20], [30, 10]]
Upvotes: 2