amwalker
amwalker

Reputation: 355

How to remove multiple, repeating ranges of lines from a csv file?

I am working with a csv file which is an output from a Gas chromatograph data analyzer so I can only manipulate what is provided. I need to remove lines that are unnecessary from the csv file or keep only the necessary lines. There are 960 lines in the actual file.

The 1st 8 lines in the file look like this

[Line 1]  Remove
[Line 2]  Remove
[Line 3]  Keep
[Line 4]  Remove
[Line 5]  Remove
[Line 6]  Remove
[Line 7]  Keep
[Line 8]  Keep

The pattern of line ranges I want to keep/remove continues for hundreds of lines, so here is the next 8 lines as an example.

[Line 9]   Remove
[Line 10]  Remove
[Line 11]  Keep
[Line 12]  Remove
[Line 13]  Remove
[Line 14]  Remove
[Line 15]  Keep
[Line 16]  Keep

There are no string patterns that discern these lines only the line numbers themselves. I would like to avoid having to calculate the ranges from hundreds of lines and put them all in sed like the script shown below which only cuts the desired number of lines for the first 8 lines only.

    sed '1,2d; 4,6d' test.csv >> cut_test.csv

I am hoping for the following:

[Line 3]  Keep
[Line 7]  Keep
[Line 8]  Keep
[Line 11] Keep
[Line 15] Keep
[Line 16] Keep

Upvotes: 2

Views: 434

Answers (5)

potong
potong

Reputation: 58558

This might work for you (GNU sed):

sed -n 'n;n;p;n;n;n;n;p;n;p' file

Does as it says on the tin.

Better (already mentioned by Thor):

sed -n '3~8p;7~8,+1p' file

Upvotes: 1

Allan
Allan

Reputation: 12456

If the line numbers to keep are following the exact pattern (repeating every 8 lines) that you have provided in your explanation, you can use the following GNU sed command:

$ sed '1~8d;2~8d;4~8d;5~8d;6~8d;' input.csv 
[Line 3]  Keep
[Line 7]  Keep
[Line 8]  Keep
[Line 11]  Keep
[Line 15]  Keep
[Line 16]  Keep

and redirect it to a new file or user -i.back to change the file in-place.

Explanation:

  • 1~8d will execute the d command on the 1st line, 9th line,...
  • 2~8d will execute the d command on the 2nd line, 10th line,...

input.csv:

$ cat input.csv 
[Line 1]  Remove
[Line 2]  Remove
[Line 3]  Keep
[Line 4]  Remove
[Line 5]  Remove
[Line 6]  Remove
[Line 7]  Keep
[Line 8]  Keep
[Line 9]   Remove
[Line 10]  Remove
[Line 11]  Keep
[Line 12]  Remove
[Line 13]  Remove
[Line 14]  Remove
[Line 15]  Keep
[Line 16]  Keep

You can even simplify the command by regrouping everything in the following way (that is close to your command):

$ sed '1~8,2~8d;4~8,6~8d;' input.csv 
[Line 3]  Keep
[Line 7]  Keep
[Line 8]  Keep
[Line 11]  Keep
[Line 15]  Keep
[Line 16]  Keep

As mentioned by Thor you can reduce the command if, instead of deleting the lines you want to remove, you just print the lines you want to keep:

$ sed -n '3~8p;7~8,8~8p;' input.csv
[Line 3]  Keep
[Line 7]  Keep
[Line 8]  Keep
[Line 11]  Keep
[Line 15]  Keep
[Line 16]  Keep

Upvotes: 5

Amanda Ellaway
Amanda Ellaway

Reputation: 143

The sed solution is elegant, but as you also tagged Python, here's an equivalent solution in that language. It should scale to enormous files if it ever becomes necessary, because it never reads the entire file at once (which I believe is true of the sed solution too):

import itertools

with open('input.csv', 'r') as in_file:
    with open('output.csv', 'w') as out_file:
        out_file.writelines(entry for entry, keep in zip(in_file.readlines(), itertools.cycle([False, False, True, False, False, False, True, True])) if keep)

Upvotes: 1

Walter A
Walter A

Reputation: 20032

Short answer:

Default action in awk for a match is printing the line: awk 'NR%8~/3|7|0/' input.csv

Long answer, inspired by the comments of @kvantour

awk 'NR%8~/3|7|0/' input.csv
# or shorter (when module < 10)
awk 'NR%8~/[037]/' input.csv

When you need modulo > 9, you need to match the complete line with the ^$ markers. With modulo 25 and lines 3,7,8,11,14,22 you can use

awk 'NR%25~/^[3|7|0|11|14|22]$/' input.csv
# or shorter
awk 'NR%25~/^[037]|1[14]|22$/' input.csv

This becomes harder to read for more values. An alternative is

# Original case
awk 'BEGIN {a[3];a[7];a[0]} NR%8 in a' input.csv 
# 3,7,8,11,14,22
awk 'BEGIN {a[3];a[7];a[8];a[11];a[14];a[22];} NR%25 in a' input.csv 

Pulling the numbers out:

# Original case
awk 'FNR==NR {a[$0];next} FNR%8 in a' <(printf "%s\n" 3 7 0) input.csv 
# 3,7,8,11,14,22
awk 'FNR==NR {a[$0];next} FNR%25 in a' <(printf "%s\n" 3 7 8 11 14 22) input.csv 

Upvotes: 1

Davis Herring
Davis Herring

Reputation: 40033

The Python approach is just

import sys
for i,l in enumerate(sys.stdin):
  if i%8 in (2,6,7): print(l)  # 0-based

Upvotes: 3

Related Questions