colding
colding

Reputation: 573

Regular expression to extract text from XML-ish data using GNU sed

I have a file full of lines extracted from an XML file using "gsed regexp -i FILENAME". The lines in the file are all of one of either format:

<field number='1' name='Account' type='STRING'W/>

<field number='2' name='AdvId' type='STRING'W>

I've inserted a 'W' in the end which represents optional whitespace. The order and number of properties are not necessarily the same in all lines throughout the file although "number" is always before "type".

What I'm searching for is a regular expression "regexp" that I can give to gnu sed so that this command:

gsed regexp -i FILENAME

gives me a file with lines looking like this:

1 STRING

2 STRING

I don't care about the amount of whitespace in the result as long as there is some after the number and a newline at the end of each line.

I'm sure it is possible, but I just can't figure out how in a reasonable amount of time. Can anyone help?

Thanks a lot, jules

Upvotes: 0

Views: 275

Answers (7)

Patrick B.
Patrick B.

Reputation: 12333

I'm sure this can be optimized, but it works for me and answers your question:

 sed "s/^.*number='\([0-9]*\)'.*type='\(.*\)'.*$/\1 \2/" <filename>

Saying that, I think the others are right, if you have an XML-file you should use an XML-parser.

Upvotes: 1

konsolebox
konsolebox

Reputation: 75458

sed -ni "/<field .*>/s@^.*[[:space:]]number='\\([^']\\+\\).*[[:space:]]type='\\([^']\\+\\).*@\1 \2@p" FILENAME

Or if you don't mind contents of number and type to be optional:

sed -ni "/<field .*>/s@^.*[[:space:]]number='\\([^']*\\).*[[:space:]]type='\\([^']*\\).*@\1 \2@p" FILENAME

Just change from [^']\\+ to [^']* at your preference.

Upvotes: 0

Song Gao
Song Gao

Reputation: 666

You would be better off using an XML parser, but if you had to use sed:

sed 's/<field number=\'(.*?)\'.*?type=\'(.*?)\'/\1 \2

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You can use this:

sed -r "s/<field [^>]*?number='([0-9]+)'[^>]*?type='([^']+)'[^>]*>/\1 \2/"

Upvotes: 0

dtorgo
dtorgo

Reputation: 2116

Simple cut should work for you:

cut -f2,6 -d"'" --output-delimiter=" "

If you really want sed:

sed -r "s/.'(.)'.type='(.)'.*/\1 \2/"

Upvotes: 0

choroba
choroba

Reputation: 241768

Using xsh, a Perl wrapper around XML::LibXML:

open file.xml ;
for //field echo @number @type ;

Upvotes: 2

Brian Agnew
Brian Agnew

Reputation: 272237

I think you're much better off using a command line XML tool such as XMLStarlet. That will integrate well with the shell and let you perform XPath searches. It's XML-aware so it'll handle character encodings, whitespace correctly etc.

Upvotes: 1

Related Questions