Reputation: 464

Parsing simple string with awk or sed in linux

original string :
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/

Depth of directories will vary, but /trunk part will always remain the same. And a single character in front of /trunk is the indicator of that line.

desired output :

A /trunk/apple
B /trunk/apple
Z /trunk/orange
Q /trunk/melon/juice/venti/straw

*** edit
I'm sorry I made a mistake by adding a slash at the end of each path in the original string which made the output confusing. Original string didn't have the slash in front of the capital letter, but I'll leave it be.

my attempt :

echo $str1 | sed 's/$.\/trunk$/\n\1/g'

I feel like it should work but it doesn't.

Upvotes: 4

Answers (9)

Carlos Pascual

Reputation: 1126

Update

With your new data file:

$ cat file
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/

This GNU awk solution:

awk '
{
sub(/[/]$/,"")
gsub(/[[:upper:]]{1}/,"& ")
print gensub(/([/])([[:upper:]])/,"\n\\2","g")
}' file
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

Upvotes: 1

The fourth bird

Reputation: 163632

Using gnu awk you could use FPAT to set contents of each field using a pattern.

When looping the fields, replace the first / with /

str1="A/trunk/apple/B/trunk/apple/Z/trunk/orange"

echo $str1 | awk -v FPAT='[^/]+/trunk/[^/]+' '{    
for(i=1;i<=NF;i++) {
    sub("/", " /", $i)
    print $i
    }
}'

The pattern matches

[^/]+ Match any char except /
/trunk/[^/]+ Match /trunk/ and any char except /

Output

A  /trunk/apple
B  /trunk/apple
Z  /trunk/orange

Other patterns that can be used by FPAT after the updated question:

Matching a word boundary \\< and an uppercase char A-Z and after /trunk repeat / and lowercase chars

FPAT='\\<[A-Z]/trunk(/[a-z]+)*'

If the length of the strings for the directories after /trunk are at least 2 characters:

FPAT='\\<[A-Z]/trunk(/[^/]{2,})*'

If there can be no separate folders that consist of a single uppercase char A-Z

FPAT='\\<[A-Z]/trunk(/([^/A-Z][^/]*|[^/]{2,}))*'

Output

A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

Upvotes: 2

Renaud Pacalet

Reputation: 29345

With GNU sed:

$ str="A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/"
$ sed -E 's|/?(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"

A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

Note the first empty output line. If it is undesirable we can separate the processing of the first output line:

$ sed -E 's|(.)|\1 |;s|/(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

Upvotes: 2

stevesliva

Reputation: 5665

Some fun with perl, where you can using nonconsuming regex to autosplit into the @F array, then just print however you want.

perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'

Step #1: Split

perl -lanF/(?=.{1,2}trunk)/'
This will take the input stream, and split each line whenever the pattern .{1,2}trunk is encountered
Because we want to retain trunk and the preceeding 1 or 2 chars, we wrap the split pattern in the (?=) for a non-consuming forward lookahead
This splits things up this way:

$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print join " ", @F'
A /trunk/apple/ B /trunk/apple/ Z /trunk/orange/citrus/ Q /trunk/melon/juice/venti/straw/

Step 2: Format output:

The @F array contains pairs that we want to print in order, so we'll iterate half of the array indices, and print 2 at a time:
print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2 --> Double the iterator, and print pairs
using perl -l means each print has an implicit \n at the end
The results:

$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
A /trunk/apple/
B /trunk/apple/
Z /trunk/orange/citrus/
Q /trunk/melon/juice/venti/straw/

Endnote: Perl obfuscation that didn't work.

Any array in perl can be cast as a hash, of the format (key,val,key,val....)
So %F=@F; print "$_ $F{$_}" for keys %F seems like it would be really slick
But you lose order:

$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e '%F=@F; print "$_ $F{$_}" for keys %F'
Z /trunk/orange/citrus/
A /trunk/apple/
Q /trunk/melon/juice/venti/straw/
B /trunk/apple/

Upvotes: 1

sseLtaH

Reputation: 11247

Assuming your data will always be in the format provided as a single string, you can try this sed.

$ sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g' input_file

$ echo "A/trunk/apple/pine/skunk/B/trunk/runk/bunk/apple/Z/trunk/orange/T/fruits/apple/mango/P/anything/apple/pear/banana/L/ball/orange/anything/S/fruits/apple/mango/B/rupert/cream/travel/scout/H/tall/mountains/pottery/barnes" | sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g'
A /trunk/apple/pine/skunk
B /trunk/runk/bunk/apple
Z /trunk/orange
T /fruits/apple/mango
P /anything/apple/pear/banana
L /ball/orange/anything
S /fruits/apple/mango
B /rupert/cream/travel/scout
H /tall/mountains/pottery/barnes

Upvotes: 1

RavinderSingh13

Reputation: 133760

To deal with complex samples input, like where there could be N number of / and values after trunk in a single line please try following.

awk '
{
  gsub(/[^/]*\/trunk/,OFS"&")
  sub(/^ /,"")
  sub(/\//,OFS"&")
  gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&")
  sub(/\n/,OFS)
  gsub(/\n /,ORS)
  gsub(/\/trunk/,OFS"&")
  sub(/[[:space:]]+/,OFS)
}
1
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                                            ##Starting awk program from here.
{
  gsub(/[^/]*\/trunk/,OFS"&")                    ##Globally substituting everything from / to till next / followed by trunk/ with space and matched value.
  sub(/^ /,"")                                   ##Substituting starting space with NULL here.
  sub(/\//,OFS"&")                               ##Substituting first / with space / here.
  gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&")    ##Globally substituting spaces followed by everything till / trunk till space comes with new line and matched values.
  sub(/\n/,OFS)                                  ##Substituting new line with space.
  gsub(/\n /,ORS)                                ##Globally substituting new line space with ORS.
  gsub(/\/trunk/,OFS"&")                         ##Globally substituting /trunk with OFS and matched value.
  sub(/[[:space:]]+/,OFS)                        ##Substituting spaces with OFS here.
}
1                                                ##Printing edited/non-edited line here.
'  Input_file                                    ##Mentioning Input_file name here.

With your shown samples, please try following awk code.

awk '{gsub(/\/trunk/,OFS "&");gsub(/trunk\/[^/]*\//,"&\n")} 1' Input_file

Upvotes: 3

Ed Morton

Reputation: 204558

With GNU awk for multi-char RS and RT:

$ awk -v RS='([^/]+/){2}[^/\n]+' 'RT{sub("/",OFS,RT); print RT}' file
A trunk/apple
B trunk/apple
Z trunk/orange

I'm setting RS to a regexp describing each string you want to match, i.e. 2 repetitions of non-/s followed by / and then a final string of non-/s (and non-newline for the last string on the input line). RT is automatically set to each of the matching strings, so then I just change the first / to a blank and print the result.

If each path isn't always 3 levels deep but does always start with something/trunk/, e.g.:

$ cat file
A/trunk/apple/banana/B/trunk/apple/Z/trunk/orange

then:

$ awk -v RS='[^/]+/trunk/' 'RT{if (NR>1) print pfx $0; pfx=gensub("/"," ",1,RT)} END{printf "%s%s", pfx, $0}' file
A trunk/apple/banana/
B trunk/apple/
Z trunk/orange

Upvotes: 3

potong

Reputation: 58568

This might work for you (GNU sed):

sed 's/[^/]*/& /;s/\//\n/3;P;D' file

Separate the first word from the first / by a space.

Replace the third / by a newline.

Print/delete the first line and repeat.

If the first word has the property that it is only one character long:

sed 's/./& /;s#/\(./\)#\n\1#;P;D' file

Or if the first word has the property that it begins with an upper case character:

sed 's/[[:upper:]][^/]*/& /;s#/\([[:upper:][^/]*/\)#\n\1#;P;D' file

Or if the first word has the property that it is followed by /trunk/:

sed -E 's#([^/]*)(/trunk/)#\n\1 \2#g;s/.//' file

Upvotes: 2

Andre Wildberg

Reputation: 19211

In awk you can try this solution. It deals with the special requirement of removing forward slashes when the next character is upper case. Will not win a design award but works.

$ echo "A/trunk/apple/B/trunk/apple/Z/trunk/orange" | 
    awk -F '' '{ x=""; for(i=1;i<=NF;i++){ 
      if($(i+1)~/[A-Z]/&&$i=="/"){$i=""}; 
      if($i~/[A-Z]/){ printf x""$i" "}
      else{ x="\n"; printf $i } }; print "" }'
A /trunk/apple
B /trunk/apple
Z /trunk/orange

Also works for n words. Actually works with anything that follows the given pattern.

$ echo "A/fruits/apple/mango/B/anything/apple/pear/banana/Z/ball/orange/anything" | 
    awk -F '' '{ x=""; for(i=1;i<=NF;i++){
      if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
      if($i~/[A-Z]/){ printf x""$i" "}
      else{ x="\n"; printf $i } }; print "" }'
A /fruits/apple/mango
B /anything/apple/pear/banana
Z /ball/orange/anything

Upvotes: 2

Parsing simple string with awk or sed in linux

Answers (9)

Related Questions