Radu M.
Radu M.

Reputation: 5738

Merge lines between pattern

I have a large file (~20GB) that I need to parse. I need to merge all lines that don't start with an integer. The file look's like this:

     1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
     1906 ut perspiciatis unde omnis iste natus error sit 
     1909  Nemo enim ipsam voluptatem
            dolores eos qui ratione
       quia non numquam eius
         nisi ut aliquid ex ea com
     1820 zt enim ad minim veniam

In the end I need it to look like this:

     1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
     1906 ut perspiciatis unde omnis iste natus error sit 
     1909  Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
     1820 zt enim ad minim veniam

I tried a lot of things, for example:

With a registry ... works great on smaller files, on large files it runs out of memory

sed ':a;N;$!ba;s/\n[\t ]*\([a-zA-Z]\+\)/ \1/g'

Using Hold buffer (this prints just the lines that start with integer):

sed -n '
/^[ \t]\+[0-9]\+/ {
    p
    h
}
/^[ \t]\+[0-9]\+/ !{
    H
}
'

or:

sed -n '
/^[ \t]\+[0-9]\+/ b jumpTO
        H
        $ b jumpTO
        b
:jumpTO
x
p
'

It misses the code for replacing the space and tab from lines without integers, they are not important and are trivial to implement.

Please take a look at the code and point what I am doing wrong. Thank you

Upvotes: 0

Views: 783

Answers (6)

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed ':a;$!N;s/\n\s*\([^0-9 ]\)/ \1/;ta;P;D' file

Upvotes: 1

rush
rush

Reputation: 2564

Nothing is impossible with sed =)

sed -n '/^[0-9]/{x;p};/^[^0-9]/{H;x;s/\n\s*\([^0-9]\)/ \1/;x};${x;p}'

1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit
1909  Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam

Upvotes: 1

Steve
Steve

Reputation: 54392

One way using awk:

awk '/[0-9]/ { if (line) print line; line = $0; next } { sub(/^[ \t]+/, " "); line = line $0 } END { print line }' file.txt

Results:

     1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
     1906 ut perspiciatis unde omnis iste natus error sit 
     1909  Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
     1820 zt enim ad minim veniam

Upvotes: 1

Vijay
Vijay

Reputation: 67211

may be this will work for you:

->nawk '{if($1~"^[a-zA-Z]"){p=p" "$0;flag=1}else{if(flag==1)print p;p=$0;print p;flag=0}}' temp

> cat temp
1906 ut perspiciatis unde omnis iste natus error sit 
1909  Nemo enim ipsam voluptatem
dolores eos qui ratione
quia non numquam eius
nisi ut aliquid ex ea com
1820 zt enim ad minim veniam

> nawk '{if($1~"^[a-zA-Z]"){p=p" "$0;flag=1}else{if(flag==1)print p;p=$0;print p;flag=0}}' temp
1906 ut perspiciatis unde omnis iste natus error sit 
1909  Nemo enim ipsam voluptatem
1909  Nemo enim ipsam voluptatem dolores eos qui ratione quia non numquam eius nisi ut aliquid ex ea com
1820 zt enim ad minim veniam

>

Upvotes: 1

Lee Netherton
Lee Netherton

Reputation: 22482

sed is generally not very good at combining lines together, as it is designed to work on a per-line basis.

A solution using awk might be better:

input.txt:

$ cat input.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit 
1909  Nemo enim ipsam voluptatem
       dolores eos qui ratione
  quia non numquam eius
    nisi ut aliquid ex ea com
1820 zt enim ad minim veniam

script.awk:

/^[0-9]+/ {
    if (NR==1) {
        printf "%s", $0
    } else {
        printf "\n%s", $0
    }
}

/^[^0-9]+/ {
    gsub(/^ /,"",$0);
    gsub(/ $/,"",$0);
    printf "%s", $0
}

END {
    printf "\n"
}

output:

$ awk -f script.awk input.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit 
1909  Nemo enim ipsam voluptatemdolores eos qui rationequia non numquam eiusnisi ut aliquid ex ea com
1820 zt enim ad minim veniam

Update: improved code to remove whitespace

Upvotes: 1

Kent
Kent

Reputation: 195039

does this one-liner help? (awk)

 awk '$1~/[0-9]+/{printf "\n"$0;next}{printf $0}' yourfile

test

kent$  cat test.txt
1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit 
1909  Nemo enim ipsam voluptatem
       dolores eos qui ratione
  quia non numquam eius
    nisi ut aliquid ex ea com
1820 zt enim ad minim veniam

kent$  awk '$1~/[0-9]+/{printf "\n"$0;next}{printf $0}' test.txt

1647 Lorem ipsum dolor sit amet, consectetur adipisicing elit
1906 ut perspiciatis unde omnis iste natus error sit 
1909  Nemo enim ipsam voluptatem       dolores eos qui ratione  quia non numquam eius    nisi ut aliquid ex ea com
1820 zt enim ad minim veniam

Upvotes: 1

Related Questions