Floran Gmehlin
Floran Gmehlin

Reputation: 854

Extract multiple lines from large text file with sed while preserving each trailing newline (Bash Script)

I have a large text file of several millions of line of which I need to extract specific lines.

Since I need to extract about 300000 lines (line numbers to be extracted are read from a file), I process them in batch of x lines (say 200) to speed up the process with the following command :

sed '1000p;1002p;2003p;...(200 times)...10001q;d' large_text_file >> extracted.txt

First I construct the string 1000p;1002p;2003p;...(200 times)...10001q;d, then I call the sed command with the string as argument and repeat this until all lines are processed.

 sed_lines="1000p;1002p;2003p;...(200 times)...10001q;d"
 sed "$sed_lines" large_text_file >> extracted.txt

The problem I have is that the these 200 lines are now considered as one single line as sed does not keep the \n at the end of each line.

Question 1: Is there an option in sed for preserving the \n at the end of each line ?

Answer 1: Ok I figured this quickly after writing this post. Basically I missed the double quotes around $sentences in the line :

echo $sentences >> $forig.pseudo ==> echo "$sentences" >> $forig.pseudo

Question 2: Is there a faster way to do this ?

Answer 2: fedorqui's answer with awk is really fast and efficient

For the sake of comprehension, here is the bulk of script that does this process (edited with fedorqui's comment about the while):

echo "Extracting lines"
sed_lines=""
while IFS=$'\t' read -r linenr rest; do
        sed_lines+="$linenr"                   # Append line number
        ((cnt++))                              # Batch counter
        if [ "$cnt" -eq 200 ]; then
                sed_lines+="q;d"               
                sentences=$(sed "$sed_lines" $forig)   # Extract lines from file
                ((thres_cnt+=100))
                echo "$thres_cnt lines processed"
                echo $sentences >> $forig.pseudo       # Write lines to new file
                sed_lines=""
                cnt=0
        else
                sed_lines+="p;"
        fi
done < "$fperp"_cut_sorted

Upvotes: 1

Views: 1852

Answers (1)

fedorqui
fedorqui

Reputation: 289745

What about using awk for this? Firstly store the line number in an array and then just keep checking if the line number of the file is in that array:

awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file

Sample

$ cat line_numbers
8
16
4
6
9
$ cat file
1 hello
2 hello
3 hello
4 hello
5 hello
6 hello
7 hello
8 hello
9 hello
10 hello
11 hello
12 hello
13 hello
14 hello
15 hello
16 hello
17 hello
18 hello
19 hello
20 hello
$ awk 'FNR==NR{line[$0]=$0; next} FNR in line' line_numbers file 
4 hello
6 hello
8 hello
9 hello
16 hello

Upvotes: 3

Related Questions