Reputation: 23

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:

name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89

I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:

apples=10
rocks=3
tables=6
cups=89

I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.

VALUES=$(cat input.txt)
for x in $VALUES; do
  if [[ -n $(echo $x | grep 'name=') ]]; then
    name=$(echo $x | sed "s/name=//")
  elif [[ -n $(echo $x | grep 'value=') ]]; then
    value=$(echo $x | sed "s/value=//")
    echo "${name}=${value}" >> output.txt
  fi
done

I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.

How can I optimize this task in bash?

Upvotes: 1

Answers (5)

David Pointon

Reputation: 13

Why bother with IFS at all? Think JGE (Just good Enough) & KISS (Keep It Simple Stupdid) - shell variable expansion can do the job anyway ...

#!/usr/bin/env bash

# Keep shellcheck happy
declare line='' name=''

while read line ; do
  case "$line" in
    name=*)  name="${line%=*}"        ;;
    value=*) echo "$name=${line#*=}" ;;
  esac
done < input.txt > output.txt

... and not a sub-shell in sight ;-) And FWIW, no danger of upsetting errexit either since case doesn't update $?, whereas test does.

Upvotes: 0

John Kugelman

Reputation: 362037

There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.

1. Replace `for` with `while read`

Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.

A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.

while IFS= read -r x; do
  if [[ -n $(echo $x | grep 'name=') ]]; then
    name=$(echo $x | sed "s/name=//")
  elif [[ -n $(echo $x | grep 'value=') ]]; then
    value=$(echo $x | sed "s/value=//")
    echo "${name}=${value}" >> output.txt
  fi
done < input.txt

2. Redirect output outside loop

It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.

while IFS= read -r x; do
  if [[ -n $(echo $x | grep 'name=') ]]; then
    name=$(echo $x | sed "s/name=//")
  elif [[ -n $(echo $x | grep 'value=') ]]; then
    value=$(echo $x | sed "s/value=//")
    echo "${name}=${value}"
  fi
done < input.txt > output.txt

3. Shell string processing

One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).

while IFS='=' read -r key value; do
  case "$key" in
    name) name="$value";;
    value) echo "$name=$value";;
  fi
done < input.txt > output.txt

BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.

Upvotes: 2

Renaud Pacalet

Reputation: 29335

If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:

$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt 
apples=10
rocks=3
tables=6
cups=89

Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.

Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:

awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
         NR != 1 {name = $2}' data.txt

All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):

awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
         NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt

Upvotes: 1

Vulpo

Reputation: 893

With awk there might be a more elegant solution but you can have:

awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt

RS="\n?name=" says that the record separator is name=

FS="\nvalue=" says that the field separator for each record is value=

if($2) says to only proceed the printf is the second field exists

Upvotes: 0

choroba

Reputation: 242168

Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.

#! /bin/bash
while IFS== read k v ; do
    if [[ $k == name ]] ; then
        name=$v
    elif [[ $k == value ]] ; then
        printf '%s=%s\n' "$name" "$v"
    fi
done

Upvotes: 4

How to filter text data in bash more efficiently

Answers (5)

1. Replace `for` with `while read`

2. Redirect output outside loop

3. Shell string processing

Further reading

Related Questions

How to filter text data in bash more efficiently

Answers (5)

1. Replace for with while read

2. Redirect output outside loop

3. Shell string processing

Further reading

Related Questions

1. Replace `for` with `while read`