user3419669
user3419669

Reputation: 293

Extract lines from a file in bash

I have a file like this

I would like to extract the line with the 0 and 1 (all lines in the file) into a seperate file. However, the sequence does not have to start with a 0 but could also start with a 1. However, the line always comes directly after the line (SITE:). Moreover, I would like to extract the line SITTE itself into a seperate file. Could somebody tell me how that is doable in bash?

Upvotes: 1

Views: 1173

Answers (3)

mklement0
mklement0

Reputation: 437062

Here's a simple awk solution that matches all lines starting with SITE: and outputs the respective next line:

awk '/^SITE:/ { if (getline) print }'  infile > outfile

Simply omit the { ... } block part to extract all lines starting with SITE: themselves to a separate file:

awk '/^SITE:/' infile > outfile

If you wanted to combine both operations:

outfile1 and outfile2 are the names of the 2 output files, passed to awk as variables f1 and f2:

awk -v f1=outfile1 -v f2=outfile2 \
  '/^SITE:/ { print > f1; if (getline) print > f2 }'  infile

Upvotes: 1

Idriss Neumann
Idriss Neumann

Reputation: 3838

You could try something like :

$ egrep -o "^(0|1)+$" test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
$ grep "^SITE:" test.txt > test3.txt
$ cat test3.txt
SITE:   0    0.000340988542    0.0357651018
SITE:   1    0.000529755514   0.00324293642
SITE:   2    0.000577745511     0.052214098

Another solution, using bash :

$ while read; do [[ $REPLY =~ ^(0|1)+$ ]] && echo "$REPLY";  done < test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000

To remove the characters 0 at beginning of the line :

$ egrep "^(0|1)+$" test.txt | sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
1010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000
11010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000

UPDATE : New file format provided in comments :

$ egrep "^SITE:" test.txt|egrep -o "(0|1)+$"|sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
100000000000000000000001000001000000000000000000000000000000000000
1010010010000000000111101000010000001001010111111100000000000010010001101010100011101011110011100
10000000000
$ egrep "^SITE:" test.txt|sed "s/[01\ ]\{1,\}$//g" > test3.txt
$ cat test3.txt
SITE:   967         0.189021866    0.0169990123
SITE:   968         0.189149593     0.246619149
SITE:   969         0.189172266  6.84752689e-05

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 545508

Moreover, I would like to extract the line SITTE itself into a seperate file.

That’s the easy part:

grep '^SITE:' infile > outfile.site

Extracting the line after that is slightly harder:

grep --after-context=1 '^SITE:' infile \
    | grep '^[01]*$' \
    > outfile.nr

--after-context (or -A) specifies how many lines after the matching line to print as well. We then use the second grep to print only that line, and not the actually matching line (nor the delimiter which grep puts between each matching entry when specifying an after-context).

Alternatively, you could use the following to match the numeric lines:

grep '^[01]*$' infile > outfile.nr

That’s much easier, but it will find all lines consisting solely of 0s and 1s, regardless of whether they come after a line which starts with SITE:.

Upvotes: 1

Related Questions