Reputation: 19

Replacing everything between nth occurrences in file

I have a file with the fields as seen below :

2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise || CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | |CPT||0598

I want get the final file down to this:

2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598

I have tried the following:

sed 's/|//7'

This is great because it removes the unwanted | delimiter, in the 7th field, however, the data sometimes has more than 1 pipe in the 7th field, which my code does not pick up on in its first run through.

Is there a way with either sed, awk, or python to remove one or more | in the 7th field so that the total | pipes are only 8 total |?

Upvotes: 1

Answers (5)

dawg

Reputation: 104102

Another perl:

perl -lnE 'say join("  ",split(/(?: \| ?\|? ?)/,$_, 2))' file

Or you can use ruby if you want to treat it with a lightweight CSV parser:

ruby -r csv -lne '
    BEGIN{ options={:col_sep=>"|"} }
    CSV.parse($_, **options){ |r| 
       puts r[0..6].join("|")+" "+r[-3..-1].join("|").lstrip}
' <<< "$s"

Or sed:

sed -E 's/ \|[ |][ |]?/  /' <<< "$s"

Any prints:

2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise  CPT||0598

Note:

These replicate the two spaces between Cruise and CPT in your example. If you don't want that, remove the +" " part in the ruby and change " " to " " in the perl.

Upvotes: 1

Wiktor Stribiżew

Reputation: 627536

You can use

sed 's/|[ |]*//7'

The |[ |]* is a POSIX BRE pattern that matches

| - a pipe char
[ |]* - zero or more spaces or pipe chars (you may also use [[:blank:]|]* to match any horizontal whitespace or pipe chars).

See the online demo:

#!/bin/bash
s='2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise || CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | |CPT||0598'
sed 's/|[ |]*//7' <<< "$s"

Output:

2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598

If you need to match up to the seventh pipe char and then match consecutive spaces and pipes and remove all pipes but keep whitespaces, a Perl solution might be more suitable:

perl -pe 's{^(?:[^|]*\|){6}[^|]*\K\|[\s|]*}{$&=~s/\|//gr}e' file > newfile

See this online demo. What it does is

^(?:[^|]*\|){6}[^|]*\K\|[\s|]* matches six occurrences of zero or more chars other than | and then a | char and then again zero or more chars other than a pipe (with ^(?:[^|]*\|){6}[^|]*), \K omits the text matched and \|[\s|]* matches and consumes a pipe char and then any amount of pipe and whitespace chars
Thanks to the e flag, the RHS (replacement) is treated as a Perl expression and
$&=~s/\|//gr means that all pipes (g means multiple occurrences) are removed from the match value.

Upvotes: 6

Timur Shtatland

Reputation: 12465

Use this Perl one-liner:

perl -F'\s*\|\s*' -lane 'print join "|", @F[0..5], ( join " ", grep { /\S/ } @F[6..($#F-2)]),  @F[-2, -1];' in.txt > out.txt

Output:

2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or, if provided, on the regex specified in -F option.
-F'\s*\|\s*' : Split into @F on a literal pipe, optionally surrounded by 0 or more whitespace characters.

@F[0..5] : fields 0 through 5 of the input line (the first 6 fields, field indexes are 0-based).
join " ", grep { /\S/ } @F[6..($#F-2)]) : fields from 6 until the end, except the last 2 fields, select from these using grep only the fields with at least one non-whitespace character (\S), then join them on a space into a single string.
@F[-2, -1] : the last 2 fields of the input line.

Upvotes: 3

user14473238

Reputation:

Maybe this

awk 'BEGIN {FS="|";OFS=""} {for (i=1;i<NF;++i) if (i<7||NF-3<i) $i=$i "|"}1' file

sed ':a;s/|/&/9;t x;b;:x;s///7;t a' file

Upvotes: 2

Ed Morton

Reputation: 204638

$ awk 'BEGIN{FS=" *[|] *"; OFS="|"} {print $1, $2, $3, $4, $5, $6, $7 " " $(NF-2), $(NF-1), $NF}' file
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598

Upvotes: 3

Replacing everything between nth occurrences in file

Answers (5)

Related Questions