Reputation: 19
I have a file with the fields as seen below :
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise || CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | |CPT||0598
I want get the final file down to this:
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
I have tried the following:
sed 's/|//7'
This is great because it removes the unwanted | delimiter, in the 7th field, however, the data sometimes has more than 1 pipe in the 7th field, which my code does not pick up on in its first run through.
Is there a way with either sed, awk, or python to remove one or more | in the 7th field so that the total | pipes are only 8 total |?
Upvotes: 1
Views: 123
Reputation: 103744
Another perl:
perl -lnE 'say join(" ",split(/(?: \| ?\|? ?)/,$_, 2))' file
Or you can use ruby
if you want to treat it with a lightweight CSV parser:
ruby -r csv -lne '
BEGIN{ options={:col_sep=>"|"} }
CSV.parse($_, **options){ |r|
puts r[0..6].join("|")+" "+r[-3..-1].join("|").lstrip}
' <<< "$s"
Or sed:
sed -E 's/ \|[ |][ |]?/ /' <<< "$s"
Any prints:
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
Note:
These replicate the two spaces between Cruise
and CPT
in your example. If you don't want that, remove the
+" "
part in the ruby and change " "
to " "
in the perl.
Upvotes: 1
Reputation: 626728
You can use
sed 's/|[ |]*//7'
The |[ |]*
is a POSIX BRE pattern that matches
|
- a pipe char[ |]*
- zero or more spaces or pipe chars (you may also use [[:blank:]|]*
to match any horizontal whitespace or pipe chars).See the online demo:
#!/bin/bash
s='2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise || CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise | |CPT||0598'
sed 's/|[ |]*//7' <<< "$s"
Output:
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
If you need to match up to the seventh pipe char and then match consecutive spaces and pipes and remove all pipes but keep whitespaces, a Perl solution might be more suitable:
perl -pe 's{^(?:[^|]*\|){6}[^|]*\K\|[\s|]*}{$&=~s/\|//gr}e' file > newfile
See this online demo. What it does is
^(?:[^|]*\|){6}[^|]*\K\|[\s|]*
matches six occurrences of zero or more chars other than |
and then a |
char and then again zero or more chars other than a pipe (with ^(?:[^|]*\|){6}[^|]*
), \K
omits the text matched and \|[\s|]*
matches and consumes a pipe char and then any amount of pipe and whitespace charse
flag, the RHS (replacement) is treated as a Perl expression and$&=~s/\|//gr
means that all pipes (g
means multiple occurrences) are removed from the match value.Upvotes: 6
Reputation: 12347
Use this Perl one-liner:
perl -F'\s*\|\s*' -lane 'print join "|", @F[0..5], ( join " ", grep { /\S/ } @F[6..($#F-2)]), @F[-2, -1];' in.txt > out.txt
Output:
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-a
: Split $_
into array @F
on whitespace or, if provided, on the regex specified in -F
option.
-F'\s*\|\s*'
: Split into @F
on a literal pipe, optionally surrounded by 0 or more whitespace characters.
@F[0..5]
: fields 0 through 5 of the input line (the first 6 fields, field indexes are 0-based).
join " ", grep { /\S/ } @F[6..($#F-2)])
: fields from 6 until the end, except the last 2 fields, select from these using grep
only the fields with at least one non-whitespace character (\S
), then join them on a space into a single string.
@F[-2, -1]
: the last 2 fields of the input line.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches
perldoc perlre
: Perl regular expressions (regexes)
Upvotes: 3
Reputation:
Maybe this
awk 'BEGIN {FS="|";OFS=""} {for (i=1;i<NF;++i) if (i<7||NF-3<i) $i=$i "|"}1' file
or
sed ':a;s/|/&/9;t x;b;:x;s///7;t a' file
Upvotes: 2
Reputation: 203219
$ awk 'BEGIN{FS=" *[|] *"; OFS="|"} {print $1, $2, $3, $4, $5, $6, $7 " " $(NF-2), $(NF-1), $NF}' file
2|508|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|504|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|505|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
2|506|PNP|20-dec-2015 12:32:20|3451101|0|3xPirate Ship Cruise CPT||0598
Upvotes: 3