Kelly Chen
Kelly Chen

Reputation: 21

extracting line of length X with interval of Y lines

My testdata

aa1
bb1
cc1
aa2
bb2
cc2
aa3
bb3
cc3
aa4
bb4
cc4
aa5
bb5
cc5
aa6
bb6
cc6
aa7
bb7
cc7
aa8
bb8
cc8

Let say I wish to extract line 4-6 (aa2-cc2) into a file then skip for 6 lines and extract line 13-15 (aa5-cc5) followed by the same skipping of 6 lines. The process will repeat until the end of the file. I have written a bash script which works just fine for small files.

#!/bin/bash
for i in {2..8..3}; do

sed -n "$((3*i-2))","$((3*i))"p testdata > "$i".part

done

Now that I am dealing with a giant file of 30 GB, my script is bad for the harddisk as it will be reading the same file for thousands of times. I wish to avoid HDD damage by reading (and extract my part) the file only once. Is there a one-liner that can solves my problem?

I am not really a programmer so please bear with any terminologies mix-up in my question. Thank you for your help!

Upvotes: 2

Views: 70

Answers (4)

Thor
Thor

Reputation: 47189

You could do the loop inside sed, e.g. with GNU sed:

# Skip first 3 lines, extract 3 lines and skip 6
sed -n '4~9 { N; N; p }'

Example use:

seq 40 | sed -n '4~9 { N; N; p }'

Output:

4
5
6
13
14
15
22
23
24
31
32
33

Note that this solution only prints whole text blocks. If there are not enough lines in the final block, it will not be printed, i.e. 40,41,42 in the example above.

Explanation

  • 4~9 tells sed to, from line 4, only execute the code-block every 9 lines
  • { N; N; p } so for every 9 lines we fetch 2 more lines (N; N) then print them all p

Upvotes: 3

Akshay Hegde
Akshay Hegde

Reputation: 16997

IIUC, you want to extract lines and write to some file, if so if you can create one more file to extract list of records with range then you may try below one,

Say you got file named extract with range of your interest

$ cat extract 
4-6
13-15

This is your input file

$ cat file
aa1
bb1
cc1
aa2
bb2
cc2
aa3
bb3
cc3
aa4
bb4
cc4
aa5
bb5
cc5
aa6
bb6
cc6
aa7
bb7
cc7
aa8
bb8
cc8

If you execute like below:

$ awk -F'[- ]' 'FNR==NR{rules[FNR,"min"]=$1;rules[FNR,"max"]=$2;m=FNR;next}function is_in_list(i){for(i=1; i <=m; i++)if(FNR>=rules[i,"min"] && FNR<=rules[i,"max"])return rules[i,"min"]"-"rules[i,"max"]".txt"}{file=is_in_list()}file{ if(file in arr){ print >>file }else{ print >file; arr[file] } close(file) }' extract file

You get:

$ ls *.txt
13-15.txt  4-6.txt

Contents of each file are as follows:

$ cat 4-6.txt 
aa2
bb2
cc2

$ cat 13-15.txt 
aa5
bb5
cc5

In case if you just want to list lines then

$ awk -F'[- ]' 'FNR==NR{rules[FNR,"min"]=$1;rules[FNR,"max"]=$2;m=FNR;next}function is_in_list(i){for(i=1; i <=m; i++)if(FNR>=rules[i,"min"] && FNR<=rules[i,"max"])return rules[i,"min"]"-"rules[i,"max"]".txt"}is_in_list()' extract file
aa2
bb2
cc2
aa5
bb5
cc5

Better Readable of write to individual file:

awk -F'[- ]' '
               FNR==NR{
                   rules[FNR,"min"]=$1;
                   rules[FNR,"max"]=$2;
                   m=FNR;
                   next
               }
               function is_in_list(i)
               {
                   for(i=1; i <=m; i++)
                      if(FNR>=rules[i,"min"] && FNR<=rules[i,"max"])
                          return rules[i,"min"]"-"rules[i,"max"]".txt"
               }
               {
                  file=is_in_list()
               }
           file{ 
                  if(file in arr){ 
                      print >>file 
                  }
                  else{ 
                      print >file; 
                      arr[file] 
                  } 
                  close(file) 
            }
          ' extract file

Better Readable of listing lines, for a given range

awk -F'[- ]' '
               FNR==NR{
                   rules[FNR,"min"]=$1;
                   rules[FNR,"max"]=$2;
                   m=FNR;
                   next
               }
               function is_in_list(i)
               {
                   for(i=1; i <=m; i++)
                      if(FNR>=rules[i,"min"] && FNR<=rules[i,"max"])
                          return rules[i,"min"]"-"rules[i,"max"]".txt"
               }
               is_in_list()
          ' extract file

Upvotes: 1

randomir
randomir

Reputation: 18697

In GNU sed it's possible to use the first~step line addressing:

sed -n '4~9p; 5~9p; 6~9p' file

Upvotes: 3

glenn jackman
glenn jackman

Reputation: 247042

A single pass through the file is all that's required. Plus a little arithmetic.

awk '{n = NR % 9} 4 <= n && n <= 6' file

Upvotes: 3

Related Questions