Reputation: 41173

How to get only the first ten bytes of a binary file

I am writing a bash script that needs to get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes. These are binary files and will likely have \0's and \n's throughout the first 10 bytes. It seems like most utilities work with ASCII files. What is a good way to achieve this task?

Upvotes: 123

Answers (4)

F. Hauri - Give Up GitHub

Reputation: 70772

Edit Oct 8 2023: Add limitations chapter and How to split at line number with little explanation about unbuffered I/O.

How to split a stream (or a file) under bash

Reading SO request:

get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.

I understand:

How to split a file at specific point

As all answers here does access same file two time, instead of just split them...

Here is my two cents:

The interesting thing using Un*x is considering every whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat, grep, awk, sed, python, perl ...) work as filters.

1. Using `head` or `dd` but in a single pass

{ head -c 10 >head_part; cat >tail_part;} <file

This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part and the rest goes to tail_part.

Note: only one redirection have to be placed inside braces ({ ...;})! Depending on the look of your script you could write:

{ head -c 10 >head_part;cat;} <file >tail_part

{ head -c 10;cat >tail_part;} <file >head_part

The three previous syntax will have exactly same effect.

You could do same, using `dd`:

{ dd count=1 bs=10 of=head_part; cat;} tail_part

This stay more efficient than running two process of dd to open same file two times.

...And still use standard block size for the rest of file:

Do not use:

dd bs=10 skip=1 of=body.part

(as suggested on other posts) on big file!!! This tell dd to process the whole file by block of only 10 bytes! This will work, but will be slow and consume a lot of processor resources!!

Another sample based on read by line:

Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r):

openssl s_client -quiet stackoverflow.com:443 \
  <<<$'GET / HTTP/1.0\r\nHost: stackoverflow.com\r\n\r' |
    { sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw

or, to drop empty header delimitation line:

openssl s_client -quiet stackoverflow.com:443 \
  <<<$'GET / HTTP/1.0\r\nHost: stackoverflow.com\r\n\r' |
    { sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw

This will produce two files:

ls -gh so_*.raw
-rw-r--r-- 1 user       5.4K Apr 25 11:40  so_head.raw
-rw-r--r-- 1 user        586 Apr 25 11:40  so_body.raw

grep stackoverflow so_*.raw

so_body.raw:        <h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> stackoverflow.com</h2>
so_head.raw:Set-Cookie: __cf_bm=w1ulb0ysKOqrf9KCwIj_woKeVRl8xa3td7juOy0joFE-1729757464-1.0.1.1-jKf7olmIJP20nYAn3l008lZvlPWw8P8JmgcLzeZsILKIXJ9WPbcc1SRqugPdsAc8fXMfY8BpmvFqedgs62mtOQ; path=/; expires=Thu, 24-Oct-24 08:41:04 GMT; domain=.stackoverflow.com; HttpOnly; Secure

2. Pure bash way:

If the goal is to obtain values of first 10 bytes in a usable bash variable, here is a nice and efficient way:

Because ten byte are few, fork to head could be avoided. from Read a file by bytes in BASH:

read8() {
    local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
    read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
    printf -v $_r8_var %02X "'"$_r8_car
}
{ 
    first10=()
    for i in {0..9};do
        read8 first10[i] || break
    done
    cat
 } < "$infile" >"$outfile"

This will create an array ${first10[@]} containing hexadecimal values of first ten bytes of $infile and store rest of data into $outfile.

declare -p first10

declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")

This was a PDF (%PDF -> 25 50 44 46)... Here's another sample:

{
    first10=()
    for i in {0..9};do
        read8 first10[i] || break
    done
    cat
} <<<"Hello world!"
d!

As I didn't redirect output, string d! will be output on terminal.

echo ${first10[@]}
48 65 6C 6C 6F 20 77 6F 72 6C

printf '%b' ${first10[@]/#/\\x}
Hello worl

3. Limitations

If this work fine by using dd or head -c this could become problematic using head -n, for sample:

seq 1 4 | ( dd bs=3 count=1 2>/dev/null | sed 's/^/1:/'; sed s/^/2:/  )
1:1
1:22:
2:3
2:4

where seq output are splitted at third byte (just before newline)

seq 1 4 | ( head -c 3 | sed 's/^/1:/'; sed s/^/2:/  )
1:1
1:22:
2:3
2:4

But if you try to split file at second line:

seq 1 4 | ( head -n 2 | sed 's/^/1:/'; sed s/^/2:/  )
1:1
1:2

The last part of STDIN disappear... This is due to the working of head. Reading whole buffer in order to find end of line separator, once buffer size are read, this don't exist anymore in STDIN.

3.1 How to split a stream (or a file) at specific line number

For splitting at line number, you should use sed, with -u (unbuffered) option:
instead of head -n $lineNumber, use: sed -ue ${lineNumber}q

seq 1 4 | ( sed -u 2q | sed 's/^/1:/'; sed s/^/2:/  )
1:1
1:2
2:3
2:4

As sed read STDIN by line, unbuffered, when q (quit) are executed, STDIN still hold the whole rest of input.

4. About binary

You said:

These are binary files and will likely have \0's and \n's throughout the first 10 bytes.

{
    first10=()
    for i in {0..9};do
        read8 first10[i] || break
    done
    cat
} < <(gzip <<<"Hello world!") >/dev/null 

echo ${first10[@]}
1F 8B 08 00 00 00 00 00 00 03

( Sample with a \n at bottom of this ;)

5. As a function

read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
    read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
    printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
    local -n result=${1:-first10}     # 1st arg is array name
    local -i _i
    result=()
    for ((_i=0;_i<${2:-10};_i++));do  # 2nd arg is number of bytes
        read8 result[_i] || { unset result[_i] ; return 1 ;}
    done
    cat
}

Then (here, I use the special character ⛶ as symbol for: there was no newline. ).

get10 pdf 4 <$infile >$outfile
printf %b ${pdf[@]/#/\\x}
%PDF⛶

echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4

get10 test 8 <<<'Hello world'
rld

printf %b ${test[@]/#/\\x}
Hello Wo⛶

get10 test 24 <<<'Hello World!'
printf %b ${test[@]/#/\\x}
Hello World!

( And the last character printed is a \n! ;)

Final binary demo:

get10 test 256 < <(gzip <<<'Hello world!')

printf '%b' ${test[@]/#/\\x} | gunzip 
Hello world!

printf "  %s %s %s %s  %s %s %s %s    %s %s %s %s  %s %s %s %s\n" ${test[@]}
  1F 8B 08 00  00 00 00 00    00 03 F3 48  CD C9 C9 57
  28 CF 2F CA  49 51 E4 02    00 41 E4 A9  B2 0D 00 00
  00

Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split, head, tail and/or dd.

Upvotes: 4

psmears

Reputation: 28000

To get the first 10 bytes, as noted already:

head -c 10

To get all but the first 10 bytes (at least with GNU tail):

tail -c+11

Upvotes: 210

Mark Ransom

Reputation: 308140

You can use the dd command to copy an arbitrary number of bytes from a binary file.

dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1

Upvotes: 38

moonshadow

Reputation: 89065

head -c 10 does the right thing here.

Upvotes: 60