Reputation: 5738
I have a large file (~20GB) that I need to parse. I need to merge all lines that don't start with an integer. The file look's like this:
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem
dolores eos qui ratione
quia non numquam eius
nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
In the end I need it to look like this:
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
I tried a lot of things, for example:
With a registry ... works great on smaller files, on large files it runs out of memory
sed ':a;N;$!ba;s/\n[\t ]*\([a-zA-Z]\+\)/ \1/g'
Using Hold buffer (this prints just the lines that start with integer):
sed -n '
/^[ \t]\+[0-9]\+/ {
p
h
}
/^[ \t]\+[0-9]\+/ !{
H
}
'
or:
sed -n '
/^[ \t]\+[0-9]\+/ b jumpTO
H
$ b jumpTO
b
:jumpTO
x
p
'
It misses the code for replacing the space and tab from lines without integers, they are not important and are trivial to implement.
Please take a look at the code and point what I am doing wrong. Thank you
Upvotes: 0
Views: 783
Reputation: 58371
This might work for you (GNU sed):
sed ':a;$!N;s/\n\s*\([^0-9 ]\)/ \1/;ta;P;D' file
Upvotes: 1
Reputation: 2564
Nothing is impossible with sed
=)
sed -n '/^[0-9]/{x;p};/^[^0-9]/{H;x;s/\n\s*\([^0-9]\)/ \1/;x};${x;p}'
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
Upvotes: 1
Reputation: 54392
One way using awk
:
awk '/[0-9]/ { if (line) print line; line = $0; next } { sub(/^[ \t]+/, " "); line = line $0 } END { print line }' file.txt
Results:
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
Upvotes: 1
Reputation: 67211
may be this will work for you:
->nawk '{if($1~"^[a-zA-Z]"){p=p" "$0;flag=1}else{if(flag==1)print p;p=$0;print p;flag=0}}' temp
> cat temp
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem
dolores eos qui ratione
quia non numquam eius
nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
> nawk '{if($1~"^[a-zA-Z]"){p=p" "$0;flag=1}else{if(flag==1)print p;p=$0;print p;flag=0}}' temp
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem
1909 Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
>
Upvotes: 1
Reputation: 22482
sed
is generally not very good at combining lines together, as it is designed to work on a per-line basis.
A solution using awk
might be better:
input.txt:
$ cat input.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem
dolores eos qui ratione
quia non numquam eius
nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
script.awk:
/^[0-9]+/ {
if (NR==1) {
printf "%s", $0
} else {
printf "\n%s", $0
}
}
/^[^0-9]+/ {
gsub(/^ /,"",$0);
gsub(/ $/,"",$0);
printf "%s", $0
}
END {
printf "\n"
}
output:
$ awk -f script.awk input.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatemdolores eos qui rationequia non numquam eiusnisi ut aliquid ex ea com
1820 zt enim ad minim veniam
Update: improved code to remove whitespace
Upvotes: 1
Reputation: 195039
does this one-liner help? (awk)
awk '$1~/[0-9]+/{printf "\n"$0;next}{printf $0}' yourfile
test
kent$ cat test.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem
dolores eos qui ratione
quia non numquam eius
nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
kent$ awk '$1~/[0-9]+/{printf "\n"$0;next}{printf $0}' test.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909 Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam
Upvotes: 1