Reputation: 3891
I am trying to remove all instances of the following strings from a file:
{ "userID":(some 6 digit number), "array":[]},
In particular, I'd like to find all such substrings and replace them with nothing ('')
I started by using re.match to make sure my expression was correct:
matchObj = re.match( r'({.*?"array":\[\]\},?)', g)
This works fine and returns what I want (I put the question marks in twice to turn off the greedy default for re). But when I then move to re.sub it matches many portions of the string I wasn't expecting it to match. In particular this expression:
matchObj = re.match( r'({.*?"array":\[\]\},?)', g)
ggg = re.sub( r'({.*?"array":\[\]\},?)', '', g)
with this value for g:
g = 'fedsgedsgs {"all": [{"userID": 777, "array":[]},azgagaga{"userID": 777, "array":[{"expand":"abs","id":503711372,"sport":18,"start_time":"2015-04-15T16:11:12.000Z","local_start_time":"2015-04-15T17:11:12.000Z","distance":4.281959056854248,"duration":2\
891.0,"speed_avg":5.332083225415182,"speed_max":6.74372,"altitude_min":27.0,"altitude_max":61.0,"ascent":80.0,"descent":86.0},{"expand":"abs","id":470811412,"sport":18,"start_time":"2015-02-11T09:27:10.000Z","local_start_time":"2015-02-\
11T10:27:10.000Z","distance":0.0,"duration":0.0},{"expand":"abs","id":470755226,"sport":18,"start_time":"2015-02-11T09:25:04.000Z","local_start_time":"2015-02-11T10:25:04.000Z","distance":0.0,"duration":0.0,"speed_max":0.0,"altitude_min\
":45.0,"altitude_max":45.0},{"expand":"abs","id":470749841,"sport":18,"start_time":"2015-02-11T09:10:43.000Z","local_start_time":"2015-02-11T10:10:43.000Z","distance":0.7858999967575073,"duration":479.0,"speed_avg":5.90655529922135,"spe\
ed_max":6.82629,"altitude_min":35.0,"altitude_max":57.0,"ascent":45.0,"descent":32.0}]},{"userID": 777, "array":[{"expand":"abs","id":470745921,"sport":0,"start_time":"2015-02-11T09:00:48.000Z","local_start_time":"2015-02-11T15:00:48.00\
0Z","distance":0.0,"duration":15.0,"speed_avg":0.0}]},{"userID": 777, "array":[{"expand":"abs","id":498050248,"sport":2,"start_time":"2015-04-06T14:00:03.000Z","local_start_time":"2015-04-06T19:00:03.000Z","distance":16.55500030517578,"\
duration":2793.51,"speed_avg":21.334450601083514,"speed_max":36.3397,"altitude_min":1.8,"altitude_max":35.5,"ascent":50.7,"descent":61.8},{"expand":"abs","id":498049916,"sport":2,"start_time":"2015-04-06T13:59:35.000Z","local_start_time\
":"2015-04-06T18:59:35.000Z","distance":0.010999999940395355,"duration":10.2,"speed_avg":3.882352920139537,"speed_max":2.072,"altitude_min":8.4,"altitude_max":8.4,"ascent":0.0,"descent":0.0},{"expand":"abs","id":486139822,"sport":2,"sta\
rt_time":"2015-03-15T00:21:08.000Z","local_start_time":"2015-03-15T06:21:08.000Z","distance":23.302000045776367,"duration":3997.54,"speed_avg":20.984705635164357,"speed_max":38.4344,"altitude_min":-7.3,"altitude_max":14.6,"ascent":20.1,\
"descent":42.1},{"expand":"abs","id":486139782,"sport":2,"start_time":"2015-03-15T00:20:50.000Z","local_start_time":"2015-03-15T06:20:50.000Z","distance":0.0,"duration":2.99,"speed_avg":0.0,"speed_max":0.0,"altitude_min":4.8,"altitude_m\
ax":4.8,"ascent":0.0,"descent":0 {"userID": 777, "array":[]}, mmmmmmmm {"userID": 7767, "array":[]}, gggggggg {"userID": 74577, "array":[]}, ggggggggggggggg {"userID": 774447, "array":[]}, hrdshe {"userID": 722277, "array":[]},'
leads to this output for ggg:
In[37]: ggg
Out[37]: 'fedsgedsgs azgagaga mmmmmmmm gggggggg ggggggggggggggg hrdshe '
The expression is replacing expressions of this form with '':
{ "userID":(some 6 digit number), "array":[lots of json objects printed here.....]},
whereas I want to leave these expressions (those with nonempty arrays) intact.
I tried removing the escape keys from \[\]
because I only want to match "[]
" but then I get an error message that I have an incomplete expression. Why am I matching [....stuff....]
with junk inside and how can I match just "[]
"?
UPDATE
so this is working:
ggg = re.sub( r'"userID": [0-9]{6,6}, "array":[]},', 'FOUND IT', g)
Somehow greediness doesn't seem to be an issue. If anyone can explain to me why the above works but not the original try, I would be really interested to know.
Upvotes: 2
Views: 559
Reputation: 295344
re.match()
is implicitly anchored. That is to say:
re.match('foo', content) # find foo only at the beginning of content
...is the same as...
re.match('^foo', content) # find foo only at the beginning of content
...whereas:
re.sub('foo', 'bar', content) # replace foo with bar everywhere in content
...is implicitly unanchored, making it behave the same as
re.search('foo', content) # find foo everywhere in content
...which will find instances of foo
everywhere in content
, not only at the beginning.
Thus, to make your regexes you use with re.sub()
behave the same way as they would with re.match()
, add an explicit ^
anchor.
(BTW -- attempts to modify JSON this way are doomed to end in pain and suffering. Parse, update and reserialize -- otherwise you're opening yourself up to a wide range of bugs unnecessarily).
Upvotes: 3
Reputation: 11536
I think you misunderstood the greedy vs ungreedy. Being ungreedy don't prohibe your regex to match from a {
to a far, far away "array":[]},
.
Ungreedy will just match the closer "array:[]},
.
You may replace your *
with a [^}]
to clearly prohibe your *
to "get out" of the {}
pair.
But why not loading this using json.loads, cleaning, and rewriting this using json.dumps ? What about there is some spaces or new lines (still valid json) arond your :
or }
?
Upvotes: 0