Vijay
Vijay

Reputation: 985

selecting specific lines repetitively from a file

I have a file containing 20736 lines. Each 81 lines represent coordinates of atoms of a molecule. So I have total coordinates for 256 molecules. Now I want to select coordinates for specific part of every single molecule. For example within the 81 lines I want to only select line 44 until 81 from each molecule until all 256 molecules.

To explain much detail, I want to select lines

44-81 from 1-81 lines
126-163 from 82-163 lines
208-245 from 164-245 lines
290-327 from 246-327 lines
and so on until 20736 lines

To achieve this, I have tried with bash script like below:

#!/bin/bash           

while read line           
do           
echo "$line"           

done < malto-thermo-RT.set30.traj.pdbL1 

But I am not sure how to proceed with implementing a loop to select only lines 44 until 81 lines from every subsequent 81 lines of the file.

Appreciate I get some help.

I also wish to get solution in python,awk, and perl if can for learning purpose.

Many thanks in advance.

Upvotes: 1

Views: 195

Answers (8)

F. Hauri  - Give Up GitHub
F. Hauri - Give Up GitHub

Reputation: 70822

Edited due to SO question's error.

Using modulos are surely the best way. The original idea in this SO question was added by @rici!

Unfortunely, the SO question is wrong: ...from 82-163 lines (included), than ...from 164-245 lines, I count 82 lines, not 81.

In first, I just would like to offer my + alternative solution.

But now corrected, to better match the SO question, this could help to show where's the bug:

sed -nf <(for ((i=0;i<20736;i+=82));do echo $((i+44)),$(($i+81))p;done ) < file

Where bash generate sed commands and sed do the job.

Splitted explanation

The bash portion:

for ((i=0;i<20736;i+=82)) ;do
    echo $((i+44)),$(($i+81))p
  done

do

44,81p
126,163p
208,245p
290,327p
...
20544,20581p
20626,20663p
20708,20745p

( Nota: This match exactly the SO question sample, but don't end at 20736!!

   echo $((20746000/82))
   253000

if it represent molecules, there is only 252 full molecules, in 20736 lines. )

So the sed script could by written:

sed -ne '44,81p;126,163p;208,245p;290,327p;...;20626,20663p;20708,20745p' <file

Upvotes: 1

rici
rici

Reputation: 241771

m % n (in many programming languages) is the "modulo" operator: the remainder which is left after all the largest possible integer multiple of n is removed from m.

The lines you want to print are those lines for which the line number modulo 81 is at least 43. (This works out better if the first line is counted as line 0; making that adjustment means you want lines numbered 43-80; 124-161; 205-242 etc. (I think the OP has a small arithmetic error, but it might be an explanation error. The sequence here is based on the stanzas being 81 lines, as the OP says, rather than 82 lines as the example seems to indicate).

So, in awk:

awk  '(NR-1)%81 >= 43' 

That's based on awk's default action, which is {print}, so I didn't have to supply one.

Edit: If the example ranges provided in the OP are correct (which they would be if there were a blank line separating the 81-line stanzas, for example, then this could be changed to:

awk 'NR%82>43'

Upvotes: 3

F. Hauri  - Give Up GitHub
F. Hauri - Give Up GitHub

Reputation: 70822

Simple perl using @rici's idea of modulos:

perl -ne 'print if $.%82>43' file

Upvotes: 1

Technext
Technext

Reputation: 8107

Your problem statement is fine but you haven't tried hard. Check how a combination of head and tail commands & how to pass parameters to your script can help you achieve what you want.

http://www.ss64.com/bash/head.html
http://www.ss64.com/bash/tail.html

For example,

$ cat file
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10

In this example, we can print lines from 3 to 7 using:

$ head -7 file | tail -5
line3
line4
line5
line6
line7

Upvotes: -1

Chris Seymour
Chris Seymour

Reputation: 85795

rici has the right idea by using the modulus operator but as the records increase his solution progressively becomes out of sync as demonstrated by the following:

$ seq 350 | awk  '(NR-1)%81==43{printf "%i",$0} (NR-1)%81==80{print " -",$0}' 
44 - 81                         # In sync
125 - 162                       # Out of sync by 1 
206 - 243                       # Out of sync by 2 
287 - 324                       # Out of sync by 3 

To print the lines you requested you would do:

$ awk 'NR%82>43' file

The printed ranges are:

$ seq 350 | awk  'NR%82==44{printf "%i",$0} NR%82==81{print " -",$0}'
44 - 81
126 - 163
208 - 245
290 - 327

Test yourself with:

$ seq 350 | awk  'NR%82>43'

Upvotes: 1

Jotne
Jotne

Reputation: 41456

Using awk, you can do some like this

awk '
    {
    if (NR<=t) 
        {
        for (l=t-37;l<=t;l++) 
            printf "%s ",$l
        }
    if (NR==t)
        {
        t+=82
        }
    } ' t=81 file

Upvotes: -1

mpapec
mpapec

Reputation: 50647

perl -ne '
  BEGIN{ ($f,$t)=(44,81) }
  ($.==$f .. $.==$t) =~ /(E0|.)$/ or next;
  print;
  $1 eq "E0" or next;
  $_ += 82 for $f,$t;
' file

Upvotes: 1

goji
goji

Reputation: 7092

Here's my naive, non-idiomatic crack it it using bash:

#!/bin/bash
file=/tmp/file
segment_size=81
select_offset=44
select_size=37

start_line=$select_offset
end_line=$(($start_line + $select_size))

i=0
while read line
do
    i=$(($i + 1))
    if [ $i -ge $start_line ]; then

        [ $i -eq $start_line ] && [ $i != 1 ] && echo -e "\n-------------------\n"

        if [ $i -le $end_line ]; then
            echo "$line"

            if [ $i -eq $end_line ]; then
                start_line=$(($start_line + $segment_size + 1))
                end_line=$(($start_line + $select_size))
            fi
        fi
    fi
done < $file

Bash is certainly not my forte :\ :\ Seems to work tho!

Upvotes: 1

Related Questions