Reputation: 2836

Replace an underscore separated substring in the middle of a comma separated string

I have a file with multiple lines in it like this:

 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}

I want to replace the 1371078139195 (in this case) with another number. The value I want to replace is always in the first comma separated word and is always the second last underscore separated value in that word. The following is the way I did this and it works but this seems unseemly and clumsy.

>>> line="'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> l1=",".join(line.split(",")[1:])
>>> print l1
 {'cf:rv': '0'}
>>> l2=line.split(",")[0]
>>> print l2
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442'
>>> print "_".join(l2.split('_')[:-2])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight
>>>
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1])
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442'
>>> print "_".join(l2.split('_')[:-2])+ "_1234567_"+(l2.split('_')[-1]) + "," + l1
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}
>>>

Is there an easier way to replace (maybe using regular expressions) the value? I can't imagine that this is the best way

I have a few answers and I have to stress that its the second last underscored value. The following are valid strings:

line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}"
line = "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"

In the above case there is a digit string within the string that is not after the second last underscore. Also the last part may or may not be all digits (it could be +14155186442 or it could be 14155186442). Sorry I didn't mention this above.

Upvotes: 3

Answers (5)

6502

Reputation: 114491

Using regular expressions:

m = re.match("([^,]*_)([+]?[0-9]+)(_.*)", s)
if m:
    before = m.group(1)
    number = m.group(2)
    after = m.group(3)
    s = before + new_number(number) + after

the meaning is

[^,]*_ = how many chars you want but not commas, followed by an underscore
[+]?[0-9]+ = digits, optionally preceded by +
_.* = an underscore followed by whatever is there

This works because regexp matches are by default "greedy" so [^,]* will actually use all the underscore, stopping right before the second-last for the match to succeed.

If for example you need instead of the second-last underscore separated you need the third-last the expression could be changed to

m = re.match("([^,]*_)([+]?[0-9]+)(_[^,]*_.*)", s)

thus requiring that after the number there are at least two underscores before a comma.

Upvotes: 4

martineau

Reputation: 123473

Not as sophisticated as a regex, but relatively simple to code, understand, debug, and change in the future. Other than the separator characters, it makes no assumptions about the what letters make up a "word".

def replace_term(line, replacement):
    csep = line.split(',')
    usep = csep[0].split('_')
    return ','.join(['_'.join(usep[:-2] + [replacement] + usep[-1:])] + csep[1:])

lines = ["'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}",
         "'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}"]

for line in lines:
    print replace_term(line, 'XXX')

Output:

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_XXX_14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_XXX_1371078139195', {'cf:rv': '0'}

Upvotes: 0

R. Max

Reputation: 6710

Like this?

>>> line = "'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> re.subn('_(\d+)_', '_mynewnumber_', line, count=1) 
("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_mynewnumber_+14155186442', {'cf:rv': '0'}",
1)

Upvotes: 1

Ashwini Chaudhary

Reputation: 250961

Non-regex solution:

>>> strs = " 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"
>>> first, sep, rest = strs.partition(',')
>>> lis = first.rsplit('_', 2)
>>> lis[1] = "1111111"
>>> "_".join(lis) + sep + rest
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111111_+14155186442', {'cf:rv': '0'}"

Function:

def solve(strs, rep):                                                                                                   first, sep, rest = strs.partition(',')
    lis = first.rsplit('_', 2)
    lis[1] = rep
    return "_".join(lis) + sep + rest
... 
>>> solve(" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}", "1111")
" 'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1111_+14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_14155186442', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_2222_14155186442', {'cf:rv': '0'}"
>>> solve("'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_1371078139195_1371078139195', {'cf:rv': '0'}", "2222")
"'AMS_Investigation|txtt.co_1371078139195_BigtittedBlondOtherNight_2222_1371078139195', {'cf:rv': '0'}"

Upvotes: 3

eyquem

Reputation: 27575

import re

r = re.compile('([^,]*_)(\d+)(?=_[^_,]+,)(_.*)')

for line in ("'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}",
             "'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}"):
    print line
    print r.sub('\\1ABCDEFG\\3',line)
    print r.sub('\g<1>1234567\\3',line)

result

'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1371078139195_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_ABCDEFG_+14155186442', {'cf:rv': '0'}
'AMS_Investigation|txtt.co_23456_BigtittedBlondOtherNight_1234567_+14155186442', {'cf:rv': '0'}

\g<1> means 'group 1'. See in the doc:

In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

Upvotes: 0

Replace an underscore separated substring in the middle of a comma separated string

Answers (5)

Related Questions