Reputation: 141
I have text data in this form:
^Well/Well[ADV]+ADV ^John/John[N]+N ^has/have[V]+V+3sg+PRES ^a/a[ART]
^quite/quite[ADV]+ADV ^different/different[ADJ]+ADJ ^not/not[PART]
^necessarily/necessarily[ADV]+ADV ^more/more[ADV]+ADV
^elaborated/elaborate[V]+V+PPART ^theology/theology[N]+N *edu$
And I want it to be processed to this form:
Well John have a quite different not necessarily more elaborate theology
Basically, I need every string between the starting character /
and the ending character [
.
Here is what I tried, but I just get empty files...
#!/bin/bash
for file in probe/*.txt
do sed '///,/[/d' $file > $file.aa
mv $file.aa $file
done
Upvotes: 2
Views: 314
Reputation: 52556
With GNU grep and Perl compatible regular expressions (-P
):
$ echo $(grep -Po '(?<=/)[^[]*' infile)
Well John have a quite different not necessarily more elaborate theology
-o
retains just the matches, (?<=/)
is a positive look-behind ("make sure there is a /
, but don't include it in the match"), and [^[]*
is "a sequence of characters other than [
".
grep -Po
prints one match per line; by using the output of grep as arguments to echo
, we convert the newlines into spaces (could also be done by piping to tr '\n' ' '
).
Upvotes: 2
Reputation: 67567
awk
to the rescue!
$ awk -F/ -v RS=^ -v ORS=' ' '{print $1}' file
Well John has a quite different not necessarily more elaborated theology
Explanation set record separator (RS) to ^
to separate your logical groups, also set the field separator (FS) to /
and print the first field as your requirement. Finally, setting the output field separator (OFS) to space (instead of the default new line) keeps the extracted fields on the same line.
Upvotes: 4
Reputation: 3141
cat file|grep -oE "\/[^\[]*\[" |sed -e 's#^/##' -e 's/\[$//' | tr -s "\n" " "
Upvotes: -1