user745235
user745235

Reputation:

Using sed to replace part of a text based on a regular expression result

I need to read a log file and look for the text <KEY>any_number_here</KEY> and <KEYVAL>any_number_hereDany_number_here</KEYVAL> and replace those numbers so it looks like this:

<KEY>*************5683</KEY> and <KEYVAL>*************5683D00000000000000000000</KEYVAL>

This is an example of the log line:

2016/02/01 04:20:21 [18f][00000000000001526][0][00000000000000] Some text here: [size: 000 communication_format: ISO0000 data: "<Document xmlns='bla'><KEY>44444444444445683</KEY><DATE>2017-05</DATE><DATA>2</DATA><KEYVAL>44444444444445683D00000000000000000000</KEYVAL>"]

Notice the D separating values on <KEYVAL>.

This is my first time trying sed and I could get the value inside the <KEY> tag but I don't know how to work on that value and replace part of it with *

I have only the expression to get what's inside the <KEY> tag:

sed -e 's/<KEY>\([[:digit:]]*\)<\/KEY>/ANOTHER SUBSTITUTION HERE?/' test.log

UPDATE Now I have this solution, which is the closest I got to what I need:

sed -e 's/<KEY>[[:digit:]]\{13\}/(&)/g' -e 's/(.*)/<KEY>*************/g' pan.txt

The problem with that is that it is replacing any () it finds with <KEY>************* and there are several () in the log file.

UPDATE 2

I think I found the solution:

sed -e 's/<KEY>[[:digit:]]\{13\}/(&)/g' -e 's/(.*)/<KEY>*************/g' pan.txt

This is working only for the KEY tag.

Upvotes: 0

Views: 62

Answers (2)

Benjamin W.
Benjamin W.

Reputation: 52536

As a one-liner:

$ sed -r ':a;s|(<KEY>\**)[0-9]([0-9]*[0-9]{4}</KEY>)|\1*\2|;s|(<KEYVAL>\**)[0-9]([0-9]*[0-9]{4}D[^<]*</KEYVAL>)|\1*\2|;ta' <<< "$var"
2016/02/01 04:20:21 [18f][00000000000001526][0][00000000000000] Some text here: [size: 000 communication_format: ISO0000 data: "<Document xmlns=bla><KEY>*************5683</KEY><DATE>2017-05</DATE><DATA>2</DATA><KEYVAL>*************5683D00000000000000000000</KEYVAL>"]

This handles any number of digits and always just leaves the last four. To allow for this flexibility, the overall structure of the command is as follows:

:label   # Label to branch to
s///     # Substitute one digit for <KEY>
s///     # Substitute one digit for <KEYVAL>
t label  # If a substitution took place, branch back to 'label'

So as long any of the substitutions did something, we loop back and try to replace another digit using the t command (conditional branching).

Now, for the substitutions, they look as follows:

s|(<KEY>\**)[0-9]([0-9]*[0-9]{4}</KEY>)|\1*\2|

This uses two capture groups: one that contains <KEY> and however many * are after it. Then comes a single, uncaptured digit (which we'll replace in this loop), and then the second capture group consisting of [0-9]*[0-9]{4}</KEY>, i.e., any number of digits ending in four digits and </KEY>. The substitution simply replaces the uncaptured digit with an asterisk.

Notice that I use extended regular expressions (-r option) so I don't have to escape (), and the pipe | as delimiter so I don't have to escape /.

The second substitution is almost the same:

s|(<KEYVAL>\**)[0-9]([0-9]*[0-9]{4}D[^<]*</KEYVAL>)|\1*\2|

The only difference is that it looks for KEYVAL instead of KEY, and between the closing tag and the four digits to be kept there is D[^<]*, i.e., a D followed by any number of characters other than the opening angle bracket.

Alternative solution without looping

Definitely no one-liner material, but potentially faster for huge log files:

h        # Copy pattern space to hold space

# Remove everything except digits we want to replace from pattern space
s|.*<KEY>(.*)[0-9]{4}</KEY>.*|\1|

s/./*/g  # Replace digits with '*'
G        # Append hold space to pattern space

# Rearrange pattern space
s|(.*)\n(.*<KEY>).*([0-9]{4}</KEY>.*)$|\2\1\3|

# And the the same for the KEYVAL part
h
s|.*<KEYVAL>(.*)[0-9]{4}D.*</KEYVAL>.*|\1|
s/./*/g
G
s|(.*)\n(.*<KEYVAL>).*([0-9]{4}D.*</KEYVAL>.*)$|\2\1\3|

This has to be stored in a separate file (some seds don't like the comments, so they can be removed) and then called like this:

$ sed -rf sedscr.sed <<< "$var"
2016/02/01 04:20:21 [18f][00000000000001526][0][00000000000000] Some text here: [size: 000 communication_format: ISO0000 data: "<Document xmlns=bla><KEY>*************5683</KEY><DATE>2017-05</DATE><DATA>2</DATA><KEYVAL>*************5683D00000000000000000000</KEYVAL>"]

Upvotes: 2

riteshtch
riteshtch

Reputation: 8769

$cat inputfile 
2016/02/01 04:20:21 [18f][00000000000001526][0][00000000000000] Some text here: [size: 000 communication_format: ISO0000 data: "<Document xmlns='bla'><KEY>44444444444445683</KEY><DATE>2017-05</DATE><DATA>2</DATA><KEYVAL>44444444444445683D00000000000000000000</KEYVAL>"]

$ egrep -o -e '<KEY>[0-9]+</KEY>' -e '<KEYVAL>[0-9]+D[0-9]+</KEYVAL>' inputfile | sed -r -e 's/^(<KEY>.*)([0-9]{4})(<\/KEY>)$/\1\n\2\3/g;' -e 's/^(<KEYVAL>.*)([0-9]{4}D[0-9]+)(<\/KEYVAL>)$/\1\n\2\3/g' | sed -e '1~2 s/[0-9]/*/g' | sed -n 'N;s/\n//g;p'
<KEY>*************5683</KEY>
<KEYVAL>*************5683D00000000000000000000</KEYVAL>

This handles any number of digits before 5683 in KEY, also it handles any number of digits before and after 5683D in KEYVAL. Also 5683 can be can be any 4 digits.

Upvotes: 1

Related Questions