Reputation: 21
Ruby newbie here. I'm using Ruby version 1.9.2. I working at a military facility and whenever when need to send support data to our vendors it needs to be scrubbed of idenfying IP and Hostname info. This is new role for me and now the task of scrubbing files (both text and binary) falls on me when handling support issues.
I created the following script to "scrub" files plain text files of IP address info:
File.open("subnet.htm", 'r+') do |f|
text = f.read
text.gsub!(/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/, "000.000.000.000")
f.rewind
f.write(text)
end
I need to modify my script to search and replace hostname AND IP address information on text files AND .dat binary files. I'm looking for something really simple like my little script above and I'd like the keep the processing of txt and dat files as separate scripts. The task of creating one script to do both is one I'd like to take up as learning exercise from the two separate scripts. Right now I'm under certain time constraints to scrub the supports files and send them out.
The priority for me is to scrub my binary .dat trace files which are of data type XML. These are binary performance trace files from our storage arrays and they need to have the identifying IP address information scrubbed out before sending off to support for analysis.
I've searched stackoverflow.com somewhat extensively and haven't found a question with answer that addresses my specific need and I simply having a time trying to figure out string.unpack.
Thanks.
Upvotes: 2
Views: 2040
Reputation: 303361
In general Ruby processes binary files the same as other files, with two caveats:
On Windows reading files normally translates CRLF pairs into just LF. You need to read in binary mode to ensure no conversion:
File.open('foo.bin','rb'){ ... }
In order to ensure that your binary data is not interpreted as text in some other encoding under Ruby 1.9+ you need to specify the ASCII-8BIT encoding:
File.open('foo.bin','r:ASCII-8BIT'){ ... }
However, as noted in this post, setting the 'b' flag as shown above also sets the encoding for you. Thus, just use the first code snippet above.
However, as noted in the comment by @ennuikiller, I suspect that you don't actually have true binary data. If you're really reading text files with a non-ASCII encoding (e.g. UTF-8) there is a small chance that treating them as binary will accidentally find only half of a multi-byte encoding and cause harm in the resulting file.
Edit: To use Nokogiri on XML files, you might do something like the following:
require 'nokogiri'
File.open("foo.xml", 'r+') do |f|
doc = Nokogiri.XML(f.read)
doc.xpath('//text()').each do |text_node|
# You cannot use gsub! here
text_node.content = text_node.content.gsub /.../, '...'
end
f.rewind
f.write doc.to_xml
end
Upvotes: 2
Reputation: 3265
I've done some binary file parsing, and this is how I read it in and cleaned it up:
data = File.open("file", 'rb' ) {|io| io.read}.unpack("C*").map do |val|
val if val == 9 || val == 10 || val == 13 || (val > 31 && val < 127)
end
For me, my binary file didn't have sequential character strings, so I had to do some shifting and filtering before I could read it (Hence the .map do |val| ... end
Unpack with the "C"
tag (see http://www.ruby-doc.org/core-1.9.2/String.html#method-i-unpack) will give ASCII character codes rather than the letters, so call val.chr
if you'd like to use the interpreted character instead.
I'd suggest that you open your files in a binary editor and look through them to determine how to best handle the data parsing. If they are XML, you might consider parsing them with Nokogiri or a similar XML tool.
Upvotes: 1