Reputation: 13
Trying to figure out the best way (either by using what I know in Grep / Sed / Awk) to split up an XML file based on it's individual string (key?). I have an XML file that is a SQL dump of all my current FAQ entries so it contains an entry ID and then a rather large HTML formatted document. I'm looking to split these entries up so I can easily pop them into an editor and clean up the formatting to import to a new KB / FAQ system. Here's an example of my data:
<article id="3">
<language>en</language>
<category>Category Name</category>
<keywords>Keywords, by, comma</keywords>
<question>Question?</question>
<answer>HTML Formatting</answer>
<author>Author</author>
<data>2010-05-13 09:32</data>
</article>
The XML file contains every single KB article I have back to back in this format. I am comfortable with bash to figure it out, I just don't know how to split it into multiple files based on the search.
Cheers,
Clay
Upvotes: 1
Views: 7654
Reputation: 84333
If your file is valid XML, you can use a utility like xgrep or XMLStarlet to parse the file for an XPath expression. For example, using xgrep:
xgrep -x "//article[@id]" /tmp/foo
This may be all you need. However, it won't split the articles; it just extracts the correct portions of your XML more reliably than with the use of regular expressions.
If you actually need to split the articles into separate files, you can do something like this:
xgrep -x "//article[@id]" /tmp/foo.rb |
ruby -ne 'BEGIN { counter=0 }
counter += 1 if /<article/
if /<article/ ... /<\/article/
File.open("#{counter}.xml", "a") { |f| f.puts $_ }
end'
Obviously, you could do the whole thing with a Ruby XML library, but I prefer treating this sort of problem as a shell pipeline. Your mileage may vary.
Also, please note that the Ruby script above will number your articles sequentially instead of by article ID. This may be preferable if you have duplicate IDs in your XML.
Okay, okay...I just couldn't leave this one alone. It seemed like a good idea at first to use the external shell utility in a pipeline as above, but if you're going to use Perl or Ruby anyway, you might as well just use the XmlSimple library.
The Ruby script below is a little longer than the pipeline version, but gives you much more control and flexibility. Consider all the possibilities you have with this as a starting point:
#!/usr/bin/env ruby
require 'xmlsimple'
counter = 0
node_name = 'article'
xml = XmlSimple.xml_in '/tmp/foo'
xml[node_name].uniq.each do |node|
counter = sprintf("%03d", counter.next)
XmlSimple.xml_out(node,
RootName: node_name,
OutputFile: "/tmp/#{counter}.xml")
end
Upvotes: 6
Reputation: 2564
cat file.xml | \
perl -p -i -e 'open(F, ">", ($1).".xml") if /<article id="(\d+)"/; print F;'
will split xml file based on article's ids. each article section will be stored in own file with the id number in name.
it works really fast even on hige files (sed
, awk
, etc solutions are really slow in this case).
Upvotes: 2
Reputation: 1534
Here's a simple idea for awk:
Whenever you hit a line with an article start tag, increment a counter variable by one. Then, for every line make a system call like "echo $0 >> file$COUNTER". This should be very easy to implement
Upvotes: 0