GuleLim
GuleLim

Reputation: 61

Shell script to extract certain fields from XML files

I am new to the Linux shell and I can't understand regex's.

Here is my question: I have a directory called /var/visitors and under this directory, I have directories like a, b, c, d. In each of these directories, there is a file called list.xml and here, for example, is the content of list.xml from /var/visitors/a:

<key>Name</key>
<string>Mr Jones</string>
<key>ID</key>
<string>51</string>
<key>Len</key>
<string>53151334</string>

What I want to do is to merge the Name field with its corresponding string and merge the ID field with its corresponding string. I don't need any other fields.

Name: Mr Jones
ID: 51
---
Name: Ms Maggie
ID: 502

Here is what I how far I got:

cd /var/visitors
find -name "list.xml" | xargs grep ?????

Please help.

Upvotes: 1

Views: 3692

Answers (5)

John Hascall
John Hascall

Reputation: 9416

I didn't include the separator line because I wasn't sure if you wanted it or it was just an artifact of using grep. It's easy enough to add it in:

find -name "list.xml" | xargs awk  -F '[<>]' -f xml.awk < in.dat

And the contents of xml.awk:

$2 != "string" { K=$3 }
$2 == "string" { if ((K == "Name") || (K == "ID")) print K ": " $3 }

Upvotes: 0

Greg Bender
Greg Bender

Reputation: 31

Not elegant, but this will work:

find -name "list.xml" | xargs cat | tr -d "\n" | sed 's/<\/string>/\n/g' | sed 's/<\/key>/: /g' | sed 's/<[^>]*>//g' | egrep "Name:|ID:" | sed 's/Name: /---\nName: /g'

Basically it does this:

  • remove all newlines
  • put each key value pair on its own line
  • add : separator
  • remove all element content (between < and >)
  • only save Name and ID fields (drop all others)
  • add --- separator

Sample Output:

---
Name: Greg
ID: 52
---
Name: Amy
ID: 53
---
Name: Mr Jones
ID: 51

Upvotes: 2

Agent Smith
Agent Smith

Reputation:

Assuming you have the file foo.bar containing the following text:

<key>Name</key>
<string>Mr Jones</string>
<key>ID</key>
<string>51</string>
<key>Len</key>
<string>53151334</string>

something like this will work:

$ awk -F '[<>]' '{if (FNR%2==1) {printf "%s: ",$3} else {print $3}}' foo.bar
Name: Mr Jones
ID: 51
Len: 53151334

If it's not entirely what you're wanting, shoe-horn it further to meet your specific requirements.

Upvotes: 0

Lazy Bob
Lazy Bob

Reputation: 449

This is real dirty, but if you're sure they're in the format they're in, you could throw some perl together to parse it... something like

for (<STDIN>) {
  if (/<key>([^<]*)</) { print $1 . " : "; }
  if (/<string>([^<]*)</) { print $1 . "\n"; }
}

that may not be perfect, but close to accomplishing what you're looking for. I'm sure there is probably some perl module that will parse XML for you, too, but for such a non-complex schema, I think you'll be ok without it.

Upvotes: 0

hhafez
hhafez

Reputation: 39800

Grep is not going to help you here, you are going to need to use something like sed or awk.

Upvotes: 0

Related Questions