Reputation: 1703

file cleanup using sed and regex (remove some but not all newlines)

i have a text file that i would like to load into hive. it has linebreaks within a string column so it won't load properly. from what i found out online the file needs to be preprocessed and all those linebreaks be removed. i have tried many regexes so far, but to no avail.

this is the file:

/biz/1-or-8;5.0;"a bunch of
text
with some

linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more

text
here.";2016-10-18

the desired output should be this:

/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18

i could achieve this in notepad++ by using this as a regex: (\r\n^(?!\/biz\/))+

however, when i run that regex using sed like so it doesn't work:

sed -e 's/(\r\n^(?!\/biz\/))+//g' original.csv > clean.csv

Upvotes: 2

Answers (6)

Ed Morton

Reputation: 203324

sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk. With GNU awk for multi-char RS and RT:

$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18

Upvotes: 0

mklement0

Reputation: 437353

As stated, sed doesn't support lookaround assertions such as (?!\/biz\/).

Since your input is essentially record-oriented, awk offers a convenient solution.

With GNU awk or Mawk (required to support multi-character input record separators):

awk -v RS='/biz/' '$1=$1 { print RS $0 }' file

RS='/biz/' splits the input into records by /biz/ (reserved variable RS is the input-record separator, \n by default).
$1=$1 looks like a no-op, but actually rebuilds the input record at hand ($0) by normalizing any record-internal runs of whitespace - including newlines - to a single space each, relying on awk's default field-splitting and output behavior.
- Additionally, since $1=$1 serves as a pattern (conditional), the outcome of the assignment decides whether the associated action ({ ... }) is executed for the record at hand.
  For an empty record - such as the implied one before the very first /biz - the assignment returns '', which in a Boolean context evaluates to false and therefore skips the associated block.
{ print RS $0 } prints the rebuilt input record, prefixed by the input record separator; print automatically appends the output record separator, ORS, which defaults to \n.

^{Note: Your code references \r\n, i.e., Windows-style CRLF line breaks. Since you're trying to use sed, I trust that the versions of the Unix utilities available to you on Windows transparently handle CRLF.

If you're actually on a Unix platform and only happen to be dealing with a Windows-originated file, a little more work is needed.}

Upvotes: 1

Sundeep

Reputation: 23667

sed doesn't support lookarounds, perl does

$ perl -0777 -pe 's/(\n^(?!\/biz\/))+//mg' original.csv 
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18

-0777 option will slurp entire file as single string
m option allows to use ^$ anchors in multiline strings

Note, line endings in Unix like systems do not use \r, but if your input does have them, use \r\n as specified used in OP.

Use different delimiter to avoid having to escape /

perl -0777 -pe 's|(\n^(?!/biz/))+||mg' original.csv

Another way to do it is delete all \n characters between a pair of double quotes

$ perl -0777 -pe 's|".*?"|$&=~s/\n//gr|gse' ip.txt 
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18

s modifier allows .* to match across multiple lines and e modifier allows to use expression instead of string in replacement
$&=~s/\n//gr allows to perform substitution on matched text ".*?"

Upvotes: 0

slitvinov

Reputation: 5768

Create files

$ cat biz.awk
{ # read entire input to a string `f' (skips newlines)
    f = f $0 
}

END {
    gsub("[^^]/biz/", "\n/biz/", f) # add a newline to all but the
                                    # first /biz/
    print f
}

and

$ cat file
/biz/1-or-8;5.0;"a bunch of
text
with some

linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more

text
here.";2016-10-18

Usage:

awk -f biz.awk file

Upvotes: 0

karakfa

Reputation: 67467

awk to the rescue! (with multi-char RS support)

$ awk -v RS='\n?^/' 'NF{$1=$1; print "/" $0}' file

$ awk -v RS='\n?^/' 'NF{$1="/"$1}NF' file

Upvotes: 0

Mustafa DOGRU

Reputation: 4112

maybe this can help you;

 sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}'

test ;

$ sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}' test
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18

Upvotes: 0

file cleanup using sed and regex (remove some but not all newlines)

Answers (6)

Related Questions