Reputation: 51
First time posting, so please be kind. I am reading through the file "bar" one line at a time and using sed to replace every other line in "foo" (starting with the first line) with the line read from "bar". The code below works, but it is painfully slow when "foo" is 48,890 lines and "bar" is ~24,445 lines (exactly half foo's length).
Does anyone have a recommendation on how to speed this process up?
x=1
while read i;do
sed -i "$x s/^.*$/$i/" foo
x=$[$x +2]
done < bar
Upvotes: 4
Views: 1460
Reputation: 52112
Interleaving with paste
and awk:
paste -d '\n' bar <(awk 'NR%2==0' foo)
or, if process substitution is not available:
awk 'NR%2==0' foo | paste -d '\n' bar -
To replace foo
:
paste -d '\n' bar <(awk 'NR%2==0' foo) > tmp && mv tmp foo
or
awk 'NR%2==0' foo | paste -d '\n' bar - > tmp && mv tmp foo
I benchmarked a little (just execution time, ignoring memory requirements).
Create input files (about ten times larger than in the question):
$ dd if=/dev/urandom count=500000 | tr -cd [:alpha:] | fold -w 100 |
> sed 's/^/foo /' > foo
$ dd if=/dev/urandom count=250000 | tr -cd [:alpha:] | fold -w 100 |
> sed 's/^/bar /' > bar
$ wc -l foo bar
539994 foo
270126 bar
810120 total
I used time
to measure execution time. All solutions had their output redirected to a new file. Results in seconds, averaged over five tries each:
codeforester 9.878
codeforester, mapfile 8.072
Fred 17.332
Charles Duffy 'Argument list too long"
Claude 27.448
Barmar 0.298
Benjamin W. 0.176
Charles' also blew up with input at 10% of the size used here.
Upvotes: 6
Reputation: 6995
This is a heavily modified version of my first answer, which I am posting separately following the benchmarks submitted.
#!/bin/bash
exec 3< foo
exec 4< bar
eof=0
IFS=
n=$'\n'
while :
do
readarray -n 2 -u 3 fl && read -r -u 4 bl || break
echo "${fl[1]}$bl"
done
# Add remaining data
[[ -n ${fl[1]} ]] || echo "$fl"
[[ -n $bl ]] || echo "$bl"
# Cat the rest of the lines from foo (if any), if bar did not
# have enough lines compared to foo
cat <&3
# Close file descriptors
exec 3>&-
exec 4>&-
Turns out my "hand optimized" solution is simpler and more readable than my first version, which goes to show that thinking about speed sometimes brings about simplification, which is always good.
On my machine, the test for my first answer runs in about the same time as for the benchmark, and with this new answer in a bit less than 7 seconds, which is considerably quicker, but nothing as fast as the awk
solution, of course.
EDIT
I replaced the two reads in "foo" by one readarray, which reduced run time (on my machine) from about 9 seconds to below 7, more than I would have thought. This leads me to think significant improvements could be made by reading both files in arrays (but not the whole file to avoid the risk of hitting memory limits), at the cost of additional code complexity, obviously.
Upvotes: 0
Reputation: 10039
awk seems the best alternative because it doesn't create sub shell at each line for reading, it take all the files in one process with few modification/complication for it
# Oneliner for batch or command line
awk 'FNR==NR{b[NR]=$0;next}{if(NR%2==1)$0=b[((NR+1)/2)];print}' bar foo
Same code but self commented for understanding
awk '# when reading first file (bar)
FNR == NR {
# load line content into an array
bar[ NR] = $0
# cycle to next line (don't go further in the code for this input line)
next
}
# every line from other files (only foo here)
{
# every odd line, replace content with corresponding array content
# NR = record line and is odd so (NR + 1) / 2 -> half the line number uprounded
if (NR % 2 == 1) $0 = bar [ ( ( NR + 1 ) / 2)]
# print the line (modified or not)
print
}
' bar foo
Upvotes: 1
Reputation: 42999
I think the slowness in your current solution is caused by a huge number of forks needed for sed
as well as heavy I/O caused by repeated rewrite of your file. Here is a pure Bash solution with zero forks:
#!/bin/bash
# read "bar" file into an array - this should take less memory than "foo"
while read -r line; do
bar_array+=("$line")
done < bar
# traverse "foo" file and replace odd lines with the lines from "bar"
# we don't need to read the whole file into memory
i=0
max_bar="${#bar_array[@]}"
while read -r line; do
#
# we look at bar_array only when we are within the limits of that file
#
p="$line"
if ((i < max_bar && i % 2 == 0)); then
p=${bar_array[$i]}
fi
printf "%s\n" "$p"
((i++))
done < foo
Example run:
bar's content:
11
22
33
44
55
foo's content:
1
2
3
4
5
6
7
8
Output:
11
2
33
4
55
6
7
8
With Bash 4 and above, the read statement
while read -r line; do
bar_array+=("$line")
done < bar
can also be written as:
mapfile -t bar_array < bar
Upvotes: 2
Reputation: 6995
Other answers suggest approaches based on storing whole files in arrays. This will have some practical limitations at some point depending on file size.
One other way is to simply read from both files, one line at a time, opening them in separate file descriptors.
#!/bin/bash
exec 3< foo
exec 4< bar
eof_bar=0
eof_foo=0
while [[ $eof_bar = 0 ]]
do
# Foo line we keep
IFS= read -r -u 3 foo_line || eof_foo=$?
[[ "$eof_foo" != 0 ]] || [[ -n "$foo_line" ]] || break
printf "%s\n" "$foo_line"
# Bar line we will replace with
IFS= read -r -u 4 bar_line || eof_bar=$?
[[ "$eof_bar" = 0 ]] || [[ -n "$bar_line" ]] || break
# Foo line we skip (line from bar was present)
IFS= read -r -u 3 foo_line
[[ "$eof_foo" != 0 ]] || [[ -n "$foo_line" ]] || break
# Actual replacement (both files had required lines)
printf "%s\n" "$bar_line"
done
# Cat the rest of the lines from foo (if any), if bar did not
# have enough lines compared to foo
cat <&3
# Close file descriptors
exec 3>&-
exec 4>&-
The code reads two lines from foo
for each line from bar
, and simply skips printing the second line from foo
that is read at each iteration.
Doing it this way will use very little memory, so files of arbitrary size can be handled.
Upvotes: 1
Reputation: 1014
Here's a streaming solution that can work using small constant memory, in case you have really huge files on a machine with little RAM.
#!/bin/bash
# duplicate lines in bar to standard output
paste -d '\n' bar bar |
# pair line-by-line foo with lines from previous command
paste -d '|' foo - |
# now the stream is like:
# foo line 1|bar line 1
# foo line 2|bar line 1
# foo line 3|bar line 2
# foo line 4|bar line 2
# foo line 5|bar line 3
# ...
{
# set field separator to correspond with previous paste delimiter
IFS='|'
# read pairs of lines, discarding the second
while read -r foo bar && read -r junk
do
# print the odd lines from foo
printf "%s\n" "$foo"
# interleaved with the lines from bar
printf "%s\n" "$bar"
done
}
You have to choose a delimiter (here |
) that doesn't occur in foo
. Tested with:
paste (GNU coreutils) 8.26
Upvotes: 0
Reputation: 295315
Run all your sed
commands in one invocation, and you rewrite foo
only once, instead of rewriting it once per line of bar
.
x=1
sed_exprs=( )
while IFS= read -r i; do
sed_exprs+=( -e "$x s/^.*$/$i/" )
x=$(( x + 2 ))
done < bar
sed "${sed_exprs[@]}" -i foo
Upvotes: 0
Reputation: 780798
Here's an awk
solution. It reads all of bar
into an array. When it's reading foo
, it prints the line or the next element of this array depending on whether it's an odd or even line number.
awk 'BEGIN {index1 = 1}
FNR == NR {file1[NR] = $0; next}
NR % 2 == 1 { print file1[index1++]; next }
{ print }' bar foo > newfoo
Upvotes: 4