Reputation: 854
I have a large text file of several millions of line of which I need to extract specific lines.
Since I need to extract about 300000 lines (line numbers to be extracted are read from a file), I process them in batch of x lines (say 200) to speed up the process with the following command :
sed '1000p;1002p;2003p;...(200 times)...10001q;d' large_text_file >> extracted.txt
First I construct the string 1000p;1002p;2003p;...(200 times)...10001q;d
, then I call the sed
command with the string as argument and repeat this until all lines are processed.
sed_lines="1000p;1002p;2003p;...(200 times)...10001q;d"
sed "$sed_lines" large_text_file >> extracted.txt
The problem I have is that the these 200 lines are now considered as one single line as sed
does not keep the \n
at the end of each line.
Question 1: Is there an option in sed for preserving the \n at the end of each line ?
Answer 1: Ok I figured this quickly after writing this post. Basically I missed the double quotes around $sentences
in the line :
echo $sentences >> $forig.pseudo ==> echo "$sentences" >> $forig.pseudo
Question 2: Is there a faster way to do this ?
Answer 2: fedorqui's answer with awk
is really fast and efficient
For the sake of comprehension, here is the bulk of script that does this process (edited with fedorqui's comment about the while):
echo "Extracting lines"
sed_lines=""
while IFS=$'\t' read -r linenr rest; do
sed_lines+="$linenr" # Append line number
((cnt++)) # Batch counter
if [ "$cnt" -eq 200 ]; then
sed_lines+="q;d"
sentences=$(sed "$sed_lines" $forig) # Extract lines from file
((thres_cnt+=100))
echo "$thres_cnt lines processed"
echo $sentences >> $forig.pseudo # Write lines to new file
sed_lines=""
cnt=0
else
sed_lines+="p;"
fi
done < "$fperp"_cut_sorted
Upvotes: 1
Views: 1852
Reputation: 289745
What about using awk
for this? Firstly store the line number in an array and then just keep checking if the line number of the file is in that array:
awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file
$ cat line_numbers
8
16
4
6
9
$ cat file
1 hello
2 hello
3 hello
4 hello
5 hello
6 hello
7 hello
8 hello
9 hello
10 hello
11 hello
12 hello
13 hello
14 hello
15 hello
16 hello
17 hello
18 hello
19 hello
20 hello
$ awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file
4 hello
6 hello
8 hello
9 hello
16 hello
Upvotes: 3