JM88
JM88

Reputation: 477

Putting line break using sed in bash, problems with regular expressions

Hi everyone my data looks like this

  samplename 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 ...
  samplename2 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 ...

and I want it to look like this:

  >samplename
  0 1 1 1 1 1 1 1 1 1 
  1 0 0 0 0 0 0 0 0 ...
  >samplename2 
  0 0 0 0 0 1 1 1 1 1 
  1 1 1 1 1 1 0 0 0 ...

[note - showing a line break after every 10 digits; I actually want it after every 200, but I realize that showing a line like that would not be very helpful].

I could do it using regular expression on a text editor but I want to use the sed command in the bash because I have to do this several times and I need 200 characters per row.

I tried this but got an error:

sed -e "s/\(>\w+\)\s\([0-9]+\)/\1\n\2" < myfile > myfile2

sed: 1: "s/(>\w+)\s([0-9]+)/ ...": unescaped newline inside substitute pattern

One more note - I am doing this on a Mac; I know that sed on the Mac is a little bit different from gnu sed . If you are able to give me the solution that works for a Mac that would be great.

Thanks in advance.

Upvotes: 3

Views: 534

Answers (5)

perreal
perreal

Reputation: 98088

fold is your friend:

sed 's/\([^ ]*\) /\1\n/' input | fold -w 100

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204258

$ awk '{print ">" $1; for (i=2;i<=NF;i++) printf "%s%s", $i, ((i-1)%10 ? FS : RS)}' file
>samplename
0 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 ...
>samplename2
0 0 0 0 0 1 1 1 1 1
1 1 1 1 1 1 0 0 0 ...

Upvotes: 1

Floris
Floris

Reputation: 46435

With your added request for a line break after 200 numbers, you are much better off using awk.

echo "hello 1 2 3 4" | awk '{print ">"$1; for(i=2; i<=NF; i++) {printf("%d ",$i); if((i+1)%2 == 0) printf("\n");}}

prints out

>hello
1 2 
3 4 

If you want this to work only on lines that start with hello, you can modify as

echo "hello 1 2 3 4" | awk '/^hello / {print ">"$1; for(i=2; =NF; i++) {printf("%d ",$i); if((i+1)%2 == 0) printf("\n");}}

(the regular expression in the / / says "only do this on lines that match this expression".

You can modify the statement if( (i + 1) % 2 == 0) to be if( (i + 1) % 100 == 0 ) to get a newline after 100 digits... I just showed it for 2 because the printout is more readable.

update to make this all much cleaner, do the following.

Create a file call breakIt with the following contents: (leave out the /^hello / if you don't want to select only lines starting with "hello"; but leave the {} around the code, it matters).

/^hello/ { print ">"$1;
   for(i=2; i<=NF; i++)
   {
      printf("%d ",$i);
      if((i+1)%100 == 0) printf("\n");
   }
   print "";
}

Now you can issue the command

awk -f breakIt inputFile > outputFile

This says "use the contents of breakIt as the commands to process inputFile and put the results in outputFile".

Should do the trick nicely for you.

edit just in case you really do want a sed solution, here is a nice one (well I think so). Copy the following into a file called sedSplit

s/^([A-Za-z]+ )/>\1\
/g
s/([0-9 ]{10})/\1\
/g
s/$/\
/g

This has three consecutive sed commands; these are each on their own line, but since they insert newlines, they actually appear to take six lines.

s/^                  - substitute, starting from the beginning of the line
([A-Za-z]+ )/        - substitute the first word (letters only) plus space, replacing with 
>\1\
/g                   - the literal '>', then the first match, then a newline, as often as needed (g)

s/([0-9] ]{10})/     - substitute 10 repetitions of [digit followed by space]
\1\
/g                   - replace with itself, followed by newline, as often as needed

s/$/\
/g                   - replace the 'end of line' with a carriage return

You invoke this sed script like this:

sed -E -f sedSplit < inputFile > outputFile

This uses the

-E flag (use extended regular expressions - no need for escaping brackets and such)

-f flag ('get instructions from this file')

It makes the whole thing much cleaner - and gives you the output you asked for on a Mac (even with an extra carriage return to separate the groups; if you don't want that, leave out the last two lines).

Upvotes: 1

Martin
Martin

Reputation: 923

In double quotes the backslash is interpreted by the shell. Either one of these should work.

sed -e 's/\(>\w+\)\s\([0-9]+\)/\1\n\2/' < myfile > myfile2
sed -e "s/\\(>\\w+\\)\\s\\([0-9]+\\)/\\1\\n\\2/" < myfile > myfile2

PS, I added the terminating slash. You had a s/.../... instead of s/.../.../

PS, as I'm looking at your regexp, sed will complain no end. Try this.

sed -e 's/^\(\w\+\)\s\+/>\1\n/' < myfile > myfile2

MAC version, with 200 character limit (100 single digits and 100 spaces)

sed -Ee 's/^([a-zA-Z0-9]+) />\1\
/' | sed -Ee 's/(([0-9] ){99}[0-9]) /\1\
/g' < myfile > myfile2

First sed separates the character string from the number, the second splits the lines.

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 247052

plain bash:

while read -r name values; do
    printf ">%s\n%s\n" "$name" "$values"
done <<END
samplename 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 ...
samplename2 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 ...
END
>samplename
0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 ...
>samplename2
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 ...

assuming the samplename does not contain whitespace

Upvotes: 0

Related Questions