mk97
mk97

Reputation: 274

sed to Find and Replace chacters between two strings

I have a pipe delimited file where some values/records in one of the columns contain pipes in the value itself making it appear as though there are more columns than there actually are - Notice how "column 8" (bolded) has pipes in the middle. This should actually display as "|col u lm n8|" with spaces in place of the pipes.

column1|column2|column3|column4|column5|column6|column7|**col|u|lm|n8**|2016|column10|column11|column12|column13|column14|

I need to replace these pipe's within column8 with spaces.

Good thing is that the data in column7 and column9 (|2016) is the same across the file so I'm able to do a sed such as this

sed 's/|/ /7g;s/.\(|2016\)/|\1/' 

However that will change all pipes after the 7th pipe to the end of the line. My question is how can I get it to change all pipes to spaces after the 7th pipe but up to the "|2016" column ?

Thank you

Upvotes: 3

Views: 99

Answers (7)

potong
potong

Reputation: 58351

This might work for you (GNU sed):

sed 's/|/&\n/7;:a;ta;s/\n\(|2016|\)/\1/;s/\n|/ \n/;ta;s/\n\(.\)/\1\n/;ta' file

Append a newline to the start of the field eight. If the newline presents itsself before field nine, delete it. If the newline is followed by a | replace the | by a space and shuffle the newline on a character. If the newline is not followed by a | shuffle the newline on a character.

N.B. On any successful substitution loop to the place holder :a.

Upvotes: 0

Walter A
Walter A

Reputation: 19982

When the file would have only one line, you could do col8=$(sed 's/([^|]|){7}(.)|2016./\2/' file ) echo "Debug line: col8=${col8}, fixed ${col8//|/}" sed 's/^(([^|]|){7}).*|2016/\1'"${col8//|/}"'|2016/' file

When you know an unique character or string, you can do about the same for a file with more lines. I will use mk97 as unique string:

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203169

With GNU awk for the 3rd arg to match():

$ awk 'match($0,/(([^|]*[|]){7})(.*)(\|2016\|.*)/,a){gsub(/\|/," ",a[3]); $0=a[1] a[3] a[4]} 1' file
column1|column2|column3|column4|column5|column6|column7|**col u lm n8**|2016|column10|column11|column12|column13|column14|

Upvotes: 0

ghoti
ghoti

Reputation: 46816

Building on what Lars provided, the following should work in all versions of sed:

sed -e ':b' -e 's/\(|column7|\)\(.*\)|\(.*|2016|\)/\1\2 \3/' -e 'tb' inputfile

This works by repeatedly replacing embedded separators until the substitute pattern can't be found. Sed's t command branches to the :b label only if the previous substitution was successful.

We use the more classic BRE both for compatibility and to avoid sed interpreting the vertical bars as "or" separators in ERE.

The sed script is separated into individual -e options because some varieties of sed require label references to be "at the end of the line", and the termination of -e's argument is considered to be equivalent to the end of the line. (GNU sed doesn't require this, but a number of other seds do.)

But as anubhava points out in comments, this is an inferior approach because it will fail if the input data includes a second 2016| somewhere to the right of column 9.

An alternate solution, if you're running bash, could be to place the fields into an array, then merge elements:

#!/usr/bin/env bash

input="column1|column2|column3|column4|column5|column6|column7|**col|u|lm|n8**|2016|column10|column11|column12|column13|column14|"

IFS=\| read -a a <<< "$input"

while [ "${a[8]}" != "2016" ]; do
  a[7]="${a[7]} ${a[8]}"   # merge elements
  unset a[8]               # delete merged element
  a=( "${a[@]}" )          # renumber array
done

printf "%s|" "${a[@]}"

Note that bash arrays start at index 0 by default. The readarray builtin allows you to specify an alternate start point for your index (-O), but that builtin started with bash version 4, and there's still a lot of version 3 in the wild. So for portability, read -a it is.

Note also that without further error checking, the above script goes into an endless loop if for some reason you don't have a "2016" field in your input data. :-)

Upvotes: 1

anubhava
anubhava

Reputation: 784898

Here is perl solution that will work for case even when |2016 appears again in the line:

cat file
column1|column2|column3|column4|column5|column6|en|col|u|lm|n8|2016|column10|column11|2016|

perl -pe 's/(en\|[^|]*|(?<!^)\G[^|]*)\|(?!2016)/$1 /g' file

column1|column2|column3|column4|column5|column6|en|col u lm n8|2016|column10|column11|2016|

This regex use PCRE construct \G, which asserts position at the end of the previous match or the start of the string for the first match.

RegEx Demo

Upvotes: 1

Haifeng Zhang
Haifeng Zhang

Reputation: 31885

This question is really interested me, I upvoted it and failed solving it in sed or awk

I tried it in python and made it. I am not providing an official answer but some ideas:)

$cat sample.csv
column1|column2|column3|column4|column5|column6|column7|col|u|lm|n8|2016|column10|column11|column12|column13|column14|

My code:

$cat test.py                                                                                                                                                                           
import re
REGEX = ur"column7\|(.+?)\|2016+?"

with open("sample.csv", "r") as inputs:
    for line in inputs:
        matches = re.findall(REGEX, line)
        column8 = matches[0]
        new_column8 = column8.replace("|", "")
        print line.replace(column8, new_column8)

Result:

$python test.py                                                                                                                                                                       
column1|column2|column3|column4|column5|column6|column7|colulmn8|2016|column10|column11|column12|column13|column14|

Upvotes: 0

Lars Fischer
Lars Fischer

Reputation: 10129

With your sample input this works for me with GNU sed 4.2.2:

sed -r ':start s/(column7.)([^\|]*?)\|(.*?.2016)/\1\2 \3/; t start' file

It replaces pipes between column7. and .2016, one pipe at a time. After an successful substitution, the t gotos back to the :start label for another substitution attempt.

Upvotes: 1

Related Questions