Reputation: 1703
i have a text file that i would like to load into hive. it has linebreaks within a string column so it won't load properly. from what i found out online the file needs to be preprocessed and all those linebreaks be removed. i have tried many regexes so far, but to no avail.
this is the file:
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
the desired output should be this:
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
i could achieve this in notepad++ by using this as a regex: (\r\n^(?!\/biz\/))+
however, when i run that regex using sed like so it doesn't work:
sed -e 's/(\r\n^(?!\/biz\/))+//g' original.csv > clean.csv
Upvotes: 2
Views: 104
Reputation: 203324
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk. With GNU awk for multi-char RS and RT:
$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
Upvotes: 0
Reputation: 437353
As stated, sed
doesn't support lookaround assertions such as (?!\/biz\/)
.
Since your input is essentially record-oriented, awk
offers a convenient solution.
With GNU awk
or Mawk (required to support multi-character input record separators):
awk -v RS='/biz/' '$1=$1 { print RS $0 }' file
RS='/biz/'
splits the input into records by /biz/
(reserved variable RS
is the input-record separator, \n
by default).
$1=$1
looks like a no-op, but actually rebuilds the input record at hand ($0
) by normalizing any record-internal runs of whitespace - including newlines - to a single space each, relying on awk
's default field-splitting and output behavior.
$1=$1
serves as a pattern (conditional), the outcome of the assignment decides whether the associated action ({ ... }
) is executed for the record at hand./biz
- the assignment returns ''
, which in a Boolean context evaluates to false and therefore skips the associated block.{ print RS $0 }
prints the rebuilt input record, prefixed by the input record separator; print
automatically appends the output record separator, ORS
, which defaults to \n
.
Note: Your code references \r\n
, i.e., Windows-style CRLF line breaks. Since you're trying to use sed
, I trust that the versions of the Unix utilities available to you on Windows transparently handle CRLF.
If you're actually on a Unix platform and only happen to be dealing with a Windows-originated file, a little more work is needed.
Upvotes: 1
Reputation: 23667
sed
doesn't support lookarounds, perl
does
$ perl -0777 -pe 's/(\n^(?!\/biz\/))+//mg' original.csv
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
-0777
option will slurp entire file as single stringm
option allows to use ^$
anchors in multiline stringsNote, line endings in Unix like systems do not use \r
, but if your input does have them, use \r\n
as specified used in OP.
Use different delimiter to avoid having to escape /
perl -0777 -pe 's|(\n^(?!/biz/))+||mg' original.csv
Another way to do it is delete all \n
characters between a pair of double quotes
$ perl -0777 -pe 's|".*?"|$&=~s/\n//gr|gse' ip.txt
/biz/1-or-8;5.0;"a bunch oftextwith somelinebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"moretexthere.";2016-10-18
s
modifier allows .*
to match across multiple lines and e
modifier allows to use expression instead of string in replacement$&=~s/\n//gr
allows to perform substitution on matched text ".*?"
Upvotes: 0
Reputation: 5768
Create files
$ cat biz.awk
{ # read entire input to a string `f' (skips newlines)
f = f $0
}
END {
gsub("[^^]/biz/", "\n/biz/", f) # add a newline to all but the
# first /biz/
print f
}
and
$ cat file
/biz/1-or-8;5.0;"a bunch of
text
with some
linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more
text
here.";2016-10-18
Usage:
awk -f biz.awk file
Upvotes: 0
Reputation: 67467
awk
to the rescue! (with multi-char RS support)
$ awk -v RS='\n?^/' 'NF{$1=$1; print "/" $0}' file
or
$ awk -v RS='\n?^/' 'NF{$1="/"$1}NF' file
Upvotes: 0
Reputation: 4112
maybe this can help you;
sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}'
test ;
$ sed -n '/^\s*$/d;$!{ 1{x;d}; H}; ${ H;x;s|\n\([^\/biz]\)| \1|g;p}' test
/biz/1-or-8;5.0;"a bunch of text with some linebreaks in between.";2016-11-03
/biz/1-or-8;2.0;"more text here.";2016-10-18
Upvotes: 0