Rnj
Rnj

Reputation: 1189

Getting matched strings on new line with sed

I am trying to get matched non-numerical strings on new line with sed

So, if I have string abc def 123 (ghi), I want output to be:

(abc)
(def)
(ghi)

This is what I have tried:

echo "abc def 123   (ghi)" | sed -r 's/([a-z]+)/(\1)\n/g'

But this outputs following:

(abc)
 (def)
 123   ((ghi)
)  

I am quite confused here. Have many doubts: Why there is leading space on line 2 and 3? Why double bracket ghi? Why 123 is not eliminated? Why, enclosing bracker came individually on last line?

Update

Actually, I wanted to extract URLs from specific domain. So using suggestions in comments and answer, I tried below:

in="https://www.example.com/user1 ddsf none  http://www.example.com/user2 kbu7f7yy"
echo $in | sed 's/http[s]*:\/\/www.example.com\/[^ ]*/&\n/g'

This printed following:

https://www.example.com/user1
 ddsf none http://www.example.com/user2
 kbu7f7yy

So, I tried this (as suggested in one )

echo $in | sed 's/.*\(http[s]*:\/\/www.example.com\/[^ ]*\).*/\1\n/g'

But I ended up getting:

http://www.example.com/user2

Upvotes: 0

Views: 69

Answers (3)

stevesliva
stevesliva

Reputation: 5665

The sed can be simple: sed 's/[()0-9]//g; s/[a-z]\+/(&)\n/g; s/ //g;'

  • Remove all parens and digits
  • Surround all words in (&)\n, where & is sed shorthand for the matched word
  • Remove all spaces

This could also be done this way: grep -Pow '[a-z]+' | sed 's/.*/(&)/'

For the url example, grep is a lot easier for extracting words than sed: grep -Pow 'http\S+'

  • -P for perl matching to allow \S+ to mean 'non-space'
  • -o for only matching
  • -w for word matching (equivalent to \bhttp\S+\b)

If, for some reason you still want to add parens, grep -Pow 'http\S+' | sed s/.*/(&)/

Upvotes: 0

potong
potong

Reputation: 58578

This might work for you (GNU sed):

sed -E '/\n/!s/\<[[:alpha:]]+\>/\n(&)\n/g;/^\([[:alpha:]]+\)/P;D' file

This surrounds alpha strings by newlines within parens and then only prints those lines that begin with an open paren, alpha characters and a closing paren.

For urls, maybe:

sed -E '/\n/!s/https?\S+/\n&\n/g;/^https?/P;D' file

Use the -E command line option so as to use extended regexps:

  • /\n/!s/https?\S+/\n&\n/g if the current line does not contain any newlines, globally substitute strings that begin http with and an optional s for that same string surrounded by newlines.
  • /^https?/P if front of the current pattern space begins with a http with an optional s, print up to and including the next new line.
  • D delete up to and including the next new line and restart the sed cycle (without fetching the next line from the file) if the pattern space is not empty.

Thus the first time through the substitution will take place and there after the printing/deleting will occur. The pattern space will be reduced each time it is processed until it is empty and then the next line will be presented to the pattern space.

Upvotes: 1

choroba
choroba

Reputation: 242443

Replace anything between the beginning of a line, letters, and the end of a line by ) (, then remove the surplus parentheses:

sed -r 's/[^a-z]+|^|$/) (/g;s/^\) | \($//g'

But I find the following Perl solution more readable:

perl -lne 'print "($1)" while /([a-z]+)/g'
  • -n reads the input line by line and runs the code for each line
  • -l removes newlines from input and adds them to output

Upvotes: 2

Related Questions