Reputation: 573

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:

<field number='1' name='Account' type='STRING'W/>

<field number='2' name='AdvId' type='STRING'W>

I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".

What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:

gsed regexp -i FILENAME

gives me a file with lines looking like this:

1 STRING

2 STRING

I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.

I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?

Thanks a lot, jules

Upvotes: 0

Answers (7)

Patrick B.

Reputation: 12393

I'm sure this can be optimized, but it works for me and answers your question:

 sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>

Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.

Upvotes: 1

konsolebox

Reputation: 75588

sed -ni "/<field .*>/s@^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*@\1 \2@p" FILENAME

Or if you don't mind contents of number and type to be optional:

sed -ni "/<field .*>/s@^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*@\1 \2@p" FILENAME

Just change from [^']\\+ to [^']* at your preference.

Upvotes: 0

Song Gao

Reputation: 666

You would be better off using an XML parser, but if you had to use sed:

sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2

Upvotes: 0

Casimir et Hippolyte

Reputation: 89629

You can use this:

sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"

Upvotes: 0

dtorgo

Reputation: 2116

Simple cut should work for you:

cut -f2,6 -d"'" --output-delimiter=" "

If you really want sed:

sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"

Upvotes: 0

choroba

Reputation: 242208

Using xsh, a Perl wrapper around XML::LibXML:

open file.xml ;
for //field echo @number @type ;

Upvotes: 2

Brian Agnew

Reputation: 272407

I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.

Upvotes: 1

Regular expression to extract text from XML-ish data using GNU sed

Answers (7)

Related Questions