sunny
sunny

Reputation: 3891

Python re.sub and re.match don't match?

I am trying to remove all instances of the following strings from a file:

{ "userID":(some 6 digit number), "array":[]},

In particular, I'd like to find all such substrings and replace them with nothing ('')

I started by using re.match to make sure my expression was correct:

matchObj = re.match( r'({.*?"array":\[\]\},?)', g)

This works fine and returns what I want (I put the question marks in twice to turn off the greedy default for re). But when I then move to re.sub it matches many portions of the string I wasn't expecting it to match. In particular this expression:

matchObj = re.match( r'({.*?"array":\[\]\},?)', g)
ggg =  re.sub( r'({.*?"array":\[\]\},?)', '', g)

with this value for g:

g = 'fedsgedsgs {"all": [{"userID": 777, "array":[]},azgagaga{"userID": 777, "array":[{"expand":"abs","id":503711372,"sport":18,"start_time":"2015-04-15T16:11:12.000Z","local_start_time":"2015-04-15T17:11:12.000Z","distance":4.281959056854248,"duration":2\
891.0,"speed_avg":5.332083225415182,"speed_max":6.74372,"altitude_min":27.0,"altitude_max":61.0,"ascent":80.0,"descent":86.0},{"expand":"abs","id":470811412,"sport":18,"start_time":"2015-02-11T09:27:10.000Z","local_start_time":"2015-02-\
11T10:27:10.000Z","distance":0.0,"duration":0.0},{"expand":"abs","id":470755226,"sport":18,"start_time":"2015-02-11T09:25:04.000Z","local_start_time":"2015-02-11T10:25:04.000Z","distance":0.0,"duration":0.0,"speed_max":0.0,"altitude_min\
":45.0,"altitude_max":45.0},{"expand":"abs","id":470749841,"sport":18,"start_time":"2015-02-11T09:10:43.000Z","local_start_time":"2015-02-11T10:10:43.000Z","distance":0.7858999967575073,"duration":479.0,"speed_avg":5.90655529922135,"spe\
ed_max":6.82629,"altitude_min":35.0,"altitude_max":57.0,"ascent":45.0,"descent":32.0}]},{"userID": 777, "array":[{"expand":"abs","id":470745921,"sport":0,"start_time":"2015-02-11T09:00:48.000Z","local_start_time":"2015-02-11T15:00:48.00\
0Z","distance":0.0,"duration":15.0,"speed_avg":0.0}]},{"userID": 777, "array":[{"expand":"abs","id":498050248,"sport":2,"start_time":"2015-04-06T14:00:03.000Z","local_start_time":"2015-04-06T19:00:03.000Z","distance":16.55500030517578,"\
duration":2793.51,"speed_avg":21.334450601083514,"speed_max":36.3397,"altitude_min":1.8,"altitude_max":35.5,"ascent":50.7,"descent":61.8},{"expand":"abs","id":498049916,"sport":2,"start_time":"2015-04-06T13:59:35.000Z","local_start_time\
":"2015-04-06T18:59:35.000Z","distance":0.010999999940395355,"duration":10.2,"speed_avg":3.882352920139537,"speed_max":2.072,"altitude_min":8.4,"altitude_max":8.4,"ascent":0.0,"descent":0.0},{"expand":"abs","id":486139822,"sport":2,"sta\
rt_time":"2015-03-15T00:21:08.000Z","local_start_time":"2015-03-15T06:21:08.000Z","distance":23.302000045776367,"duration":3997.54,"speed_avg":20.984705635164357,"speed_max":38.4344,"altitude_min":-7.3,"altitude_max":14.6,"ascent":20.1,\
"descent":42.1},{"expand":"abs","id":486139782,"sport":2,"start_time":"2015-03-15T00:20:50.000Z","local_start_time":"2015-03-15T06:20:50.000Z","distance":0.0,"duration":2.99,"speed_avg":0.0,"speed_max":0.0,"altitude_min":4.8,"altitude_m\
ax":4.8,"ascent":0.0,"descent":0 {"userID": 777, "array":[]}, mmmmmmmm {"userID": 7767, "array":[]}, gggggggg {"userID": 74577, "array":[]}, ggggggggggggggg {"userID": 774447, "array":[]}, hrdshe {"userID": 722277, "array":[]},'

leads to this output for ggg:

In[37]:   ggg
Out[37]: 'fedsgedsgs azgagaga mmmmmmmm  gggggggg  ggggggggggggggg  hrdshe '

The expression is replacing expressions of this form with '':

  { "userID":(some 6 digit number), "array":[lots of json objects printed here.....]},

whereas I want to leave these expressions (those with nonempty arrays) intact.

I tried removing the escape keys from \[\] because I only want to match "[]" but then I get an error message that I have an incomplete expression. Why am I matching [....stuff....] with junk inside and how can I match just "[]"?

UPDATE

so this is working:

ggg = re.sub( r'"userID": [0-9]{6,6}, "array":[]},', 'FOUND IT', g)

Somehow greediness doesn't seem to be an issue. If anyone can explain to me why the above works but not the original try, I would be really interested to know.

Upvotes: 2

Views: 559

Answers (2)

Charles Duffy
Charles Duffy

Reputation: 295344

re.match() is implicitly anchored. That is to say:

re.match('foo', content)   # find foo only at the beginning of content

...is the same as...

re.match('^foo', content)  # find foo only at the beginning of content

...whereas:

re.sub('foo', 'bar', content) # replace foo with bar everywhere in content

...is implicitly unanchored, making it behave the same as

re.search('foo', content) # find foo everywhere in content

...which will find instances of foo everywhere in content, not only at the beginning.


Thus, to make your regexes you use with re.sub() behave the same way as they would with re.match(), add an explicit ^ anchor.


(BTW -- attempts to modify JSON this way are doomed to end in pain and suffering. Parse, update and reserialize -- otherwise you're opening yourself up to a wide range of bugs unnecessarily).

Upvotes: 3

Julien Palard
Julien Palard

Reputation: 11536

I think you misunderstood the greedy vs ungreedy. Being ungreedy don't prohibe your regex to match from a { to a far, far away "array":[]},.

Ungreedy will just match the closer "array:[]},.

You may replace your * with a [^}] to clearly prohibe your * to "get out" of the {} pair.

But why not loading this using json.loads, cleaning, and rewriting this using json.dumps ? What about there is some spaces or new lines (still valid json) arond your : or } ?

Upvotes: 0

Related Questions