Haravikk
Haravikk

Reputation: 3290

Simplest method to convert file-size with suffix to bytes

Title says it all really, but I'm currently using a simple function with a case statement to convert human-readable file size strings into a size in bytes. It works well enough, but it's a bit unwieldy for porting into other code, so I'm curious to know if there are any widely available commands that a shell script could use instead?

Basically I want to take strings such as "100g" or "100gb" and convert them into bytes.

I'm currently doing the following:

to_bytes() {
    value=$(echo "$1" | sed 's/[^0123456789].*$//g')
    units=$(echo "$1" | sed 's/^[0123456789]*//g' | tr '[:upper:]' '[:lower:]')

    case "$units" in
        t|tb)   let 'value *= 1024 * 1024 * 1024 * 1024'    ;;
        g|gb)   let 'value *= 1024 * 1024 * 1024'   ;;
        m|mb)   let 'value *= 1024 * 1024'  ;;
        k|kb)   let 'value *= 1024' ;;
        b|'')   let 'value += 0'    ;;
        *)
                value=
                echo "Unsupported units '$units'" >&2
        ;;
    esac

    echo "$value"
}

It seems a bit overkill for something I would have thought was fairly common for scripts working with files; common enough that something might exist to do this more quickly.

If there are no widely available solutions (i.e - majority of unix and linux flavours) then I'd still appreciate any tips for optimising the above function as I'd like to make it smaller and easier to re-use.

Upvotes: 6

Views: 6479

Answers (6)

Alex Offshore
Alex Offshore

Reputation: 731

See man numfmt.

# numfmt --from=iec 42 512K 10M 7G 3.5T
42
524288
10485760
7516192768
3848290697216

# numfmt --to=iec 42 524288 10485760 7516192768 3848290697216
42
512K
10M
7.0G
3.5T

Upvotes: 9

julian firminger
julian firminger

Reputation: 1

Another variation, adding support for decimal values with a simpler T/G/M/K parser for outputs you might find from simpler Unix programs.

to_bytes() {
value=$(echo "$1" | sed -e 's/K//g' | sed -e 's/M//g' | sed -e 's/G//g' | sed -e 's/T//g' )
units=$(echo -n "$1" | grep -o .$ )
    case "$units" in
        T)   value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024 * 1024)")    ;;
        G)   value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024)")   ;;
        M)   value=$(bc <<< "scale=2; ($value * 1024 * 1024)")  ;;
        K)   value=$(bc <<< "scale=2; ($value * 1024)") ;;
        b|'')   let 'value += 0'    ;;
        *)
                value=
                echo "Unsupported units '$units'" >&2
        ;;
    esac
echo "$value"
}

Upvotes: 0

eplictical
eplictical

Reputation: 647

toBytes() {
 echo $1 | echo $((`sed 's/.*/\L\0/;s/t/Xg/;s/g/Xm/;s/m/Xk/;s/k/X/;s/b//;s/X/ *1024/g'`))
}

Upvotes: 4

Haravikk
Haravikk

Reputation: 3290

Okay, so it sounds like there's nothing built-in or widely available, which is a shame, so I've had a go at reducing the size of the function and come up with something that's only really 4 lines long, though it's a pretty complicated four lines!

I'm not sure if it's suitable as an answer to my original question as it's not really what I'd call the simplest method, but I want to put it up in case anyone thinks it's a useful solution, and it does have the advantage of being really short.

#!/bin/sh
to_bytes() {
    units=$(echo "$1" | sed 's/^[0123456789]*//' | tr '[:upper:]' '[:lower:]')
    index=$(echo "$units" | awk '{print index ("bkmgt kbgb  mbtb", $0)}')
    mod=$(echo "1024^(($index-1)%5)" | bc)
    [ "$mod" -gt 0 ] && 
        echo $(echo "$1" | sed 's/[^0123456789].*$//g')"*$mod" | bc
}

To quickly summarise how it works, it first strips the number from the string given and forces to lowercase. It then use awk to grab the index of the extension from a structured string of valid suffixes. The thing to note is that the string is arranged to multiples of five (so it would need to be widened if more extensions are added), for example k and kb are at indices 2 and 7 respectively. The index is then reduced by one and modulo'd by five so both k and kb become 1, m and mb become 2 and so-on. That's then used to raised 1024 as a power to get the size in bytes. If the extension was invalid this will resolve to a value of zero, and an extension of b (or nothing) will evaluate to 1. So long as mod is greater than zero the input string is reduced to only the numeric part and multiplied by the modifier to get the end result.

This is actually how I would probably have solved this originally if I were using a language like PHP, Java etc., it's just a bit of a weird one to put together in a shell script.

I'd still very much appreciate any simplifications though!

Upvotes: 0

Kent
Kent

Reputation: 195219

don't know if this is ok:

awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
    sub(/b$/,"")
        sub(/g/,"*"g)
        sub(/k/,"*"k)
        sub(/m/,"*"m)
        sub(/t/,"*"t)
"echo "$0"|bc"|getline r; print r; exit;}
{print "invalid input"}'
  • this only handles single line input. if multilines are needed, remove the exit
  • this checks only pattern [kgmt] and optional b. e.g. kib, mib would fail. also currently is only for lower-case.

e.g.:

kent$  echo "200kb"|awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}                                                                                                                
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
    sub(/b$/,"")
        sub(/g/,"*"g)
        sub(/k/,"*"k)
        sub(/m/,"*"m)
        sub(/t/,"*"t)
"echo "$0"|bc"|getline r
print r; exit
}{print "invalid input"}'
204800

Upvotes: 0

John Kugelman
John Kugelman

Reputation: 361977

Here's something I wrote. It supports k, KB, and KiB. (It doesn't distinguish between powers of two and powers of ten suffixes, though, as in 1KB = 1000 bytes, 1KiB = 1024 bytes.)

#!/bin/bash

parseSize() {(
    local SUFFIXES=('' K M G T P E Z Y)
    local MULTIPLIER=1

    shopt -s nocasematch

    for SUFFIX in "${SUFFIXES[@]}"; do
        local REGEX="^([0-9]+)(${SUFFIX}i?B?)?\$"

        if [[ $1 =~ $REGEX ]]; then
            echo $((${BASH_REMATCH[1]} * MULTIPLIER))
            return 0
        fi

        ((MULTIPLIER *= 1024))
    done

    echo "$0: invalid size \`$1'" >&2
    return 1
)}

Notes:

  • Leverages bash's =~ regex operator, which stores matches in an array named BASH_REMATCH.
  • Notice the cleverly-hidden parentheses surrounding the function body. They're there to keep shopt -s nocasematch from leaking out of the function.

Upvotes: 2

Related Questions