Reputation: 41173
I am writing a bash script that needs to get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes. These are binary files and will likely have \0
's and \n
's throughout the first 10 bytes. It seems like most utilities work with ASCII files. What is a good way to achieve this task?
Upvotes: 123
Views: 118198
Reputation: 70772
Edit Oct 8 2023: Add limitations chapter and How to split at line number with little explanation about unbuffered I/O.
Reading SO request:
get the header (first 10 bytes) of a file and then in another section get everything except the first 10 bytes.
I understand:
How to split a file at specific point
As all answers here does access same file two time, instead of just split them...
The interesting thing using Un*x is considering every whole job as a filter, it's easy to a split stream using unbuffered I/O. Most of standard un*x tools (cat
, grep
, awk
, sed
, python
, perl
...) work as filters.
head
or dd
but in a single pass{ head -c 10 >head_part; cat >tail_part;} <file
This is the more efficient, as your file is read only 1 time, the first 10 byte goes to head_part
and the rest goes to tail_part
.
Note: only one redirection have to be placed inside braces ({ ...;}
)! Depending on the look of your script you could write:
{ head -c 10 >head_part;cat;} <file >tail_part
or
{ head -c 10;cat >tail_part;} <file >head_part
The three previous syntax will have exactly same effect.
dd
:{ dd count=1 bs=10 of=head_part; cat;} tail_part
This stay more efficient than running two process of dd
to open same file two times.
...And still use standard block size for the rest of file:
dd bs=10 skip=1 of=body.part
(as suggested on other posts) on big file!!! This tell dd
to process the whole file by block of only 10 bytes! This will work, but will be slow and consume a lot of processor resources!!
Split HTTP (or mail) stream on near empty line (line containing only carriage return: \r
):
openssl s_client -quiet stackoverflow.com:443 \
<<<$'GET / HTTP/1.0\r\nHost: stackoverflow.com\r\n\r' |
{ sed -u '/^\r$/q' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
or, to drop empty header delimitation line:
openssl s_client -quiet stackoverflow.com:443 \
<<<$'GET / HTTP/1.0\r\nHost: stackoverflow.com\r\n\r' |
{ sed -nu '/^\r$/q;p' >/tmp/so_head.raw; cat;} >/tmp/so_body.raw
This will produce two files:
ls -gh so_*.raw
-rw-r--r-- 1 user 5.4K Apr 25 11:40 so_head.raw
-rw-r--r-- 1 user 586 Apr 25 11:40 so_body.raw
grep stackoverflow so_*.raw
so_body.raw: <h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> stackoverflow.com</h2>
so_head.raw:Set-Cookie: __cf_bm=w1ulb0ysKOqrf9KCwIj_woKeVRl8xa3td7juOy0joFE-1729757464-1.0.1.1-jKf7olmIJP20nYAn3l008lZvlPWw8P8JmgcLzeZsILKIXJ9WPbcc1SRqugPdsAc8fXMfY8BpmvFqedgs62mtOQ; path=/; expires=Thu, 24-Oct-24 08:41:04 GMT; domain=.stackoverflow.com; HttpOnly; Secure
If the goal is to obtain values of first 10 bytes in a usable bash variable, here is a nice and efficient way:
Because ten byte are few, fork to head
could be avoided. from Read a file by bytes in BASH:
read8() {
local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car
}
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < "$infile" >"$outfile"
This will create an array ${first10[@]}
containing hexadecimal values of first ten bytes of $infile
and store rest of data into $outfile
.
declare -p first10
declare -a first10=([0]="25" [1]="50" [2]="44" [3]="46" [4]="2D" [5]="31" [6]="2E"
[7]="34" [8]="0A" [9]="25")
This was a PDF (%PDF
-> 25 50 44 46
)... Here's another sample:
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} <<<"Hello world!"
d!
As I didn't redirect output, string d!
will be output on terminal.
echo ${first10[@]}
48 65 6C 6C 6F 20 77 6F 72 6C
printf '%b' ${first10[@]/#/\\x}
Hello worl
If this work fine by using dd
or head -c
this could become problematic using head -n
, for sample:
seq 1 4 | ( dd bs=3 count=1 2>/dev/null | sed 's/^/1:/'; sed s/^/2:/ )
1:1
1:22:
2:3
2:4
where seq
output are splitted at third byte (just before newline)
seq 1 4 | ( head -c 3 | sed 's/^/1:/'; sed s/^/2:/ )
1:1
1:22:
2:3
2:4
But if you try to split file at second line:
seq 1 4 | ( head -n 2 | sed 's/^/1:/'; sed s/^/2:/ )
1:1
1:2
The last part of STDIN disappear... This is due to the working of head
. Reading whole buffer in order to find end of line separator,
once buffer size are read, this don't exist anymore in STDIN.
For splitting at line number, you should use sed
, with -u
(unbuffered
) option:
instead of head -n $lineNumber
, use: sed -ue ${lineNumber}q
seq 1 4 | ( sed -u 2q | sed 's/^/1:/'; sed s/^/2:/ )
1:1
1:2
2:3
2:4
As sed
read STDIN by line, unbuffered, when q
(quit
) are executed, STDIN still hold the whole rest of input.
You said:
These are binary files and will likely have
\0
's and\n
's throughout the first 10 bytes.
{
first10=()
for i in {0..9};do
read8 first10[i] || break
done
cat
} < <(gzip <<<"Hello world!") >/dev/null
echo ${first10[@]}
1F 8B 08 00 00 00 00 00 00 03
( Sample with a \n
at bottom of this ;)
read8() { local _r8_var=${1:-OUTBIN} _r8_car LANG=C IFS=
read -r -d '' -n 1 _r8_car || { printf -v $_r8_var '';return 1;}
printf -v $_r8_var %02X "'"$_r8_car ;}
get10() {
local -n result=${1:-first10} # 1st arg is array name
local -i _i
result=()
for ((_i=0;_i<${2:-10};_i++));do # 2nd arg is number of bytes
read8 result[_i] || { unset result[_i] ; return 1 ;}
done
cat
}
Then (here, I use the special character ⛶
as symbol for: there was no newline. ).
get10 pdf 4 <$infile >$outfile
printf %b ${pdf[@]/#/\\x}
%PDF⛶
echo $(( $(stat -c %s $infile) - $(stat -c %s $outfile) ))
4
get10 test 8 <<<'Hello world'
rld
printf %b ${test[@]/#/\\x}
Hello Wo⛶
get10 test 24 <<<'Hello World!'
printf %b ${test[@]/#/\\x}
Hello World!
( And the last character printed is a \n
! ;)
get10 test 256 < <(gzip <<<'Hello world!')
printf '%b' ${test[@]/#/\\x} | gunzip
Hello world!
printf " %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s %s\n" ${test[@]}
1F 8B 08 00 00 00 00 00 00 03 F3 48 CD C9 C9 57
28 CF 2F CA 49 51 E4 02 00 41 E4 A9 B2 0D 00 00
00
Note!! This work fine and is very quick while number of byte to read stay low, even processing large files. This could be used for file recognition, for sample. But for spliting files on larger parts, you have to use split
, head
, tail
and/or dd
.
Upvotes: 4
Reputation: 28000
To get the first 10 bytes, as noted already:
head -c 10
To get all but the first 10 bytes (at least with GNU tail
):
tail -c+11
Upvotes: 210
Reputation: 308140
You can use the dd
command to copy an arbitrary number of bytes from a binary file.
dd if=infile of=outfile1 bs=10 count=1
dd if=infile of=outfile2 bs=10 skip=1
Upvotes: 38