user2607367
user2607367

Reputation: 225

Extracting the contents between two different strings using bash or perl

I have tried to scan through the other posts in stack overflow for this, but couldn't get my code work, hence I am posting a new question.

Below is the content of file temp.

 <?xml version="1.0" encoding="UTF-8"?>
 <env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/<env:Body><dp:response xmlns:dp="http://www.datapower.com/schemas/management"><dp:timestamp>2015-01-
 22T13:38:04Z</dp:timestamp><dp:file name="temporary://test.txt">XJzLXJlc3VsdHMtYWN0aW9uX18i</dp:file><dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:file></dp:response></env:Body></env:Envelope>

This file contains the base64 encoded contents of two files names test.txt and test1.txt. I want to extract the base64 encoded content of each file to seperate files test.txt and text1.txt respectively.

To achieve this, I have to remove the xml tags around the base64 contents. I am trying below commands to achieve this. However, it is not working as expected.

sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test.txt">@@g'|perl -p -e 's@</dp:file>@@g' > test.txt

sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test1.txt">@@g'|perl -p -e 's@</dp:file></dp:response></env:Body></env:Envelope>@@g' > test1.txt

Below command:

sed -n '/test.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test.txt">@@g'|perl -p -e 's@</dp:file>@@g'

produces output:

 XJzLXJlc3VsdHMtYWN0aW9uX18i

<dp:file name="temporary://test1.txt">lc3VsdHMtYWN0aW9uX18i</dp:response>   </env:Body></env:Envelope>` 

Howeveer, in the output I am expecting only first line XJzLXJlc3VsdHMtYWN0aW9uX18i. Where I am commiting mistake?

When i run below command, I am getting expected output:

sed -n '/test1.txt"\>/,/\<\/dp:file\>/p' temp | perl -p -e 's@<dp:file name="temporary://test1.txt">@@g'|perl -p -e 's@</dp:file></dp:response></env:Body></env:Envelope>@@g'

It produces below string

lc3VsdHMtYWN0aW9uX18i

I can then easily route this to test1.txt file.

UPDATE

I have edited the question by updating the source file content. The source file doesn't contain any newline character. The current solution will not work in that case, I have tried it and failed. wc -l temp must output to 1.

OS: solaris 10 Shell: bash

Upvotes: 1

Views: 271

Answers (2)

user2607367
user2607367

Reputation: 225

/usr/xpg4/bin/sed works well here.

/usr/bin/sed is not working as expected in case if the file contains just 1 line.

below command works for a file containing only single line.

/usr/xpg4/bin/sed -n 's_<env:Envelope\(.*\)<dp:file name="temporary://BackUpDir/backupmanifest.xml">\([^>]*\)</dp:file>\(.*\)_\2_p' securebackup.xml 2>/dev/null

Without 2>/dev/null this sed command outputs the warning sed: Missing newline at end of file.

This because of the below reason:

Solaris default sed ignores the last line not to break existing scripts because a line was required to be terminated by a new line in the original Unix implementation.

GNU sed has a more relaxed behavior and the POSIX implementation accept the fact but outputs a warning.

Upvotes: 0

NeronLeVelu
NeronLeVelu

Reputation: 10039

sed -n 's_<dp:file name="\([^"]*\)">\([^<]*\).*_\1 -> \2_p' temp
  • I add \1 -> to show link from file name to content but for content only, just remove this part
  • posix version so on GNU sed use --posix
  • assuming that base64 encoded contents is on the same line as the tag around (and not spread on several lines, that need some modification in this case)

Thanks to JID for full explaination below


How it works

sed -n

The -n means no printing so unless explicitly told to print, then there will be no output from sed

's_

This is to substitute the following regex using _ to separate regex from the replacement.

<dp:file name=

Regular text

"\([^"]*\)"

The brackets are a capture group and must be escaped unless the -r option is used( -r is not available on posix). Everything inside the brackets is captured. [^"]* means 0 or more occurrences of any character that is not a quote. So really this just captures anything between the two quotes.

>\([^<]*\)<

Again uses the capture group this time to capture everything between the > and <

.*

Everything else on the line

_\1 -> \2

This is the replacement, so replace everything in the regex before with the first capture group then a -> and then the second capture group.

_p

Means print the line


Resources

http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

http://www.grymoire.com/Unix/Sed.html

Upvotes: 2

Related Questions