Jean Elbers
Jean Elbers

Reputation: 51

Speed up sed replacement string read from a file

First time posting, so please be kind. I am reading through the file "bar" one line at a time and using sed to replace every other line in "foo" (starting with the first line) with the line read from "bar". The code below works, but it is painfully slow when "foo" is 48,890 lines and "bar" is ~24,445 lines (exactly half foo's length).

Does anyone have a recommendation on how to speed this process up?

x=1
while read i;do
  sed -i "$x s/^.*$/$i/" foo
  x=$[$x +2]
done < bar

Upvotes: 4

Views: 1460

Answers (8)

Benjamin W.
Benjamin W.

Reputation: 52112

Interleaving with paste and awk:

paste -d '\n' bar <(awk 'NR%2==0' foo)

or, if process substitution is not available:

awk 'NR%2==0' foo | paste -d '\n' bar -

To replace foo:

paste -d '\n' bar <(awk 'NR%2==0' foo) > tmp && mv tmp foo

or

awk 'NR%2==0' foo | paste -d '\n' bar - > tmp && mv tmp foo

I benchmarked a little (just execution time, ignoring memory requirements).

Create input files (about ten times larger than in the question):

$ dd if=/dev/urandom count=500000 | tr -cd [:alpha:] | fold -w 100 |
> sed 's/^/foo /' > foo
$ dd if=/dev/urandom count=250000 | tr -cd [:alpha:] | fold -w 100 |
> sed 's/^/bar /' > bar
$ wc -l foo bar
  539994 foo
  270126 bar
  810120 total

I used time to measure execution time. All solutions had their output redirected to a new file. Results in seconds, averaged over five tries each:

codeforester            9.878
codeforester, mapfile   8.072
Fred                   17.332
Charles Duffy          'Argument list too long"
Claude                 27.448
Barmar                  0.298
Benjamin W.             0.176

Charles' also blew up with input at 10% of the size used here.

Upvotes: 6

Fred
Fred

Reputation: 6995

This is a heavily modified version of my first answer, which I am posting separately following the benchmarks submitted.

#!/bin/bash
exec 3< foo
exec 4< bar
eof=0
IFS=
n=$'\n'
while :
do
   readarray -n 2 -u 3 fl && read -r -u 4 bl || break
   echo "${fl[1]}$bl"
done
# Add remaining data
[[ -n ${fl[1]} ]] || echo "$fl"
[[ -n $bl ]] || echo "$bl"
# Cat the rest of the lines from foo (if any), if bar did not
# have enough lines compared to foo
cat <&3
# Close file descriptors
exec 3>&-
exec 4>&-

Turns out my "hand optimized" solution is simpler and more readable than my first version, which goes to show that thinking about speed sometimes brings about simplification, which is always good.

On my machine, the test for my first answer runs in about the same time as for the benchmark, and with this new answer in a bit less than 7 seconds, which is considerably quicker, but nothing as fast as the awk solution, of course.

EDIT

I replaced the two reads in "foo" by one readarray, which reduced run time (on my machine) from about 9 seconds to below 7, more than I would have thought. This leads me to think significant improvements could be made by reading both files in arrays (but not the whole file to avoid the risk of hitting memory limits), at the cost of additional code complexity, obviously.

Upvotes: 0

NeronLeVelu
NeronLeVelu

Reputation: 10039

awk seems the best alternative because it doesn't create sub shell at each line for reading, it take all the files in one process with few modification/complication for it

# Oneliner for batch or command line
awk 'FNR==NR{b[NR]=$0;next}{if(NR%2==1)$0=b[((NR+1)/2)];print}' bar foo

Same code but self commented for understanding

awk '# when reading first file (bar)
     FNR == NR {
        # load line content into an array
        bar[ NR] = $0
        # cycle to next line (don't go further in the code for this input line)
        next
        }

     # every line from other files (only foo here)
     {
        # every odd line, replace content with corresponding array content
        # NR = record line and is odd so (NR + 1) / 2 -> half the line number uprounded
        if (NR % 2 == 1) $0 = bar [ ( ( NR + 1 ) / 2)]

        # print the line (modified or not)
        print
     }
    ' bar foo

Upvotes: 1

codeforester
codeforester

Reputation: 42999

I think the slowness in your current solution is caused by a huge number of forks needed for sed as well as heavy I/O caused by repeated rewrite of your file. Here is a pure Bash solution with zero forks:

#!/bin/bash

# read "bar" file into an array - this should take less memory than "foo"
while read -r line; do
  bar_array+=("$line")
done < bar


# traverse "foo" file and replace odd lines with the lines from "bar"
# we don't need to read the whole file into memory
i=0
max_bar="${#bar_array[@]}"
while read -r line; do
  #
  # we look at bar_array only when we are within the limits of that file
  #
  p="$line"
  if ((i < max_bar && i % 2 == 0)); then
    p=${bar_array[$i]}
  fi
  printf "%s\n" "$p"
  ((i++))
done < foo

Example run:

bar's content:

11
22
33
44
55

foo's content:

1
2
3
4
5
6
7
8

Output:

11
2
33
4
55
6
7
8

With Bash 4 and above, the read statement

while read -r line; do
  bar_array+=("$line")
done < bar

can also be written as:

mapfile -t bar_array < bar

Upvotes: 2

Fred
Fred

Reputation: 6995

Other answers suggest approaches based on storing whole files in arrays. This will have some practical limitations at some point depending on file size.

One other way is to simply read from both files, one line at a time, opening them in separate file descriptors.

#!/bin/bash

exec 3< foo
exec 4< bar

eof_bar=0
eof_foo=0

while [[ $eof_bar = 0 ]]
do
   # Foo line we keep
   IFS= read -r -u 3 foo_line || eof_foo=$?
   [[ "$eof_foo" != 0 ]] || [[ -n "$foo_line" ]] || break
   printf "%s\n" "$foo_line"
   # Bar line we will replace with
   IFS= read -r -u 4 bar_line || eof_bar=$?
   [[ "$eof_bar" = 0 ]] || [[ -n "$bar_line" ]] || break
   # Foo line we skip (line from bar was present)
   IFS= read -r -u 3 foo_line
   [[ "$eof_foo" != 0 ]] || [[ -n "$foo_line" ]] || break
   # Actual replacement (both files had required lines)
   printf "%s\n" "$bar_line"
done

# Cat the rest of the lines from foo (if any), if bar did not
# have enough lines compared to foo
cat <&3

# Close file descriptors
exec 3>&-
exec 4>&-

The code reads two lines from foo for each line from bar, and simply skips printing the second line from foo that is read at each iteration.

Doing it this way will use very little memory, so files of arbitrary size can be handled.

Upvotes: 1

Claude
Claude

Reputation: 1014

Here's a streaming solution that can work using small constant memory, in case you have really huge files on a machine with little RAM.

#!/bin/bash

# duplicate lines in bar to standard output
paste -d '\n' bar bar |

# pair line-by-line foo with lines from previous command
paste -d '|' foo - |

# now the stream is like:
#  foo line 1|bar line 1
#  foo line 2|bar line 1
#  foo line 3|bar line 2
#  foo line 4|bar line 2
#  foo line 5|bar line 3
#  ...
{
  # set field separator to correspond with previous paste delimiter
  IFS='|'
  # read pairs of lines, discarding the second
  while read -r foo bar && read -r junk
  do
    # print the odd lines from foo
    printf "%s\n" "$foo"
    # interleaved with the lines from bar
    printf "%s\n" "$bar"
  done
}

You have to choose a delimiter (here |) that doesn't occur in foo. Tested with:

paste (GNU coreutils) 8.26

Upvotes: 0

Charles Duffy
Charles Duffy

Reputation: 295315

Run all your sed commands in one invocation, and you rewrite foo only once, instead of rewriting it once per line of bar.

x=1
sed_exprs=( )
while IFS= read -r i; do
  sed_exprs+=( -e "$x s/^.*$/$i/" )
  x=$(( x + 2 ))
done < bar

sed "${sed_exprs[@]}" -i foo

Upvotes: 0

Barmar
Barmar

Reputation: 780798

Here's an awk solution. It reads all of bar into an array. When it's reading foo, it prints the line or the next element of this array depending on whether it's an odd or even line number.

awk 'BEGIN {index1 = 1}
     FNR == NR {file1[NR] = $0; next}
     NR % 2 == 1 { print file1[index1++]; next }
     { print }' bar foo > newfoo

Upvotes: 4

Related Questions