Juicce
Juicce

Reputation: 33

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.

I tried using bash script with for-loop.

for i in {0..39999}
do
    cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done

Data have following form:

28
-1373.82296 frame 0   xyz file generated by terachem
  Re       1.6345663991    0.9571586961    0.3920887712
   N       0.7107677071   -1.0248027788    0.5007181135
   N      -0.3626961076    1.1948218124   -0.4621264246
   C      -1.1299268126    0.0792071086   -0.5595954110
   C      -0.5157993503   -1.1509115191   -0.0469223696
   C       1.3354467762   -2.1017253883    1.0125736017
   C       0.7611763218   -3.3742177216    0.9821756556
   C      -1.1378354025   -2.4089069492   -0.1199253156
   C      -0.4944655989   -3.5108477831    0.4043826684
   C      -0.8597552614    2.3604180994   -0.9043060625
   C      -2.1340008843    2.4846545826   -1.4451933224
   C      -2.4023114639    0.1449111237   -1.0888703147
   C      -2.9292779079    1.3528434658   -1.5302429615
   H       2.3226814021   -1.9233467458    1.4602019023
   H       1.3128699342   -4.2076373780    1.3768411246
   H      -2.1105470176   -2.5059031902   -0.5582958817
   H      -0.9564415355   -4.4988963635    0.3544299401
   H      -0.1913951275    3.2219343258   -0.8231465989
   H      -2.4436044324    3.4620639189   -1.7693069306
   H      -3.0306593902   -0.7362803011   -1.1626515622
   H      -3.9523215784    1.4136948699   -1.9142814745
   C       3.3621999538    0.4972227756    1.1031860016
   O       4.3763020637    0.2022266109    1.5735343064
   C       2.2906331057    2.7428149541    0.0483795630
   O       2.6669163864    3.8206298898   -0.1683800650
   C       1.0351398442    1.4995168190    2.1137684156
   O       0.6510904387    1.8559680025    3.1601927094
  Cl       2.2433490373    0.2064711824   -1.9226174036

It works but it takes enormous amount of time, In future I will be working with larger file. Is there faster way to do that?

Upvotes: 0

Views: 312

Answers (3)

Gem Taylor
Gem Taylor

Reputation: 5613

You could perhaps use 2 passes of grep, rather than thousands?

Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:

grep -A27 ' frame ' | grep -B6 '-----'

If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.

Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):

-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

Upvotes: 0

kvantour
kvantour

Reputation: 26481

The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:

awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output 

This answer assumes the following form of data:

frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...

The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.

You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc

awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output 

Upvotes: 3

gmargari
gmargari

Reputation: 171

If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):

cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz

(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)

Upvotes: 1

Related Questions