Road King
Road King

Reputation: 147

Parse thousands of xml files with awk

I have several thousand files and they each contain only one very long line.

I want to convert them all to one file with one entry per line split at the ID fields and I have this working with a few files but it takes too long on hundreds of files and seems to crash on thousands of files. Looking for a faster way that is unlimited.

(find -type f -name '*.xml' -exec cat {} \;) | awk '{gsub("ID","\nID");printf"%s",$0}' 

I have also tried this..

(find -type f -name '*.xml' -exec cat {} \;) | sed 's/ID/\nID/g' 

I think the problem is trying to use replacement instead of insertion or it is using too much memory.

Thanks

Upvotes: 1

Views: 687

Answers (2)

Birei
Birei

Reputation: 36272

I can't test it with thousand of files, but instead of cat all data into memory before processing them with awk, try to run awk with some of those files at a time, like:

find . -type f -name "*.xml*" -exec awk '{gsub("ID","\nID");printf"%s",$0}' {} +

Upvotes: 2

perreal
perreal

Reputation: 98078

  1. Create a list of all files you need to process
  2. Divide this list into smaller lists each including 50 files
  3. Create a script that reads a sub-list and outputs an intermediate file, doing the ID thing also
  4. create another script that executes the script in 3, 20 process at a time, as many as necessary, as background processes
  5. merge the output files

Upvotes: 1

Related Questions