Reputation: 633
I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
@derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!
Upvotes: 1
Views: 129
Reputation: 3451
Here is how it could be done using Perl:
perl -ane 'for $f (@F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
@derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n
loop around every line of the input file, do not automatically print it
-a
autosplit mode – split input lines into the @F array. Defaults to splitting on whitespace
-e
execute the perl code
The perl code splits each input line into the @F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$
is used on each whitespace-delimited word
^
starts with
[a-zA-Z-\x27]+
one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]?
zero or one of the following punctuation: ?!;:,.
(|)
alternately match
[\d.]+
one or more numbers or .
$
end
Upvotes: 1
Reputation: 203684
Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!
Upvotes: 0
Reputation: 445
sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g'
will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]]
is any whitespace character[^a-zA-Z0-9[:space:]]
is any special character[^[:space:]]*
is any number of non whitespace charactersDo it again without a ^
instead of the first [[:space:]]
to get remove those same words at the start of the line.
Upvotes: 0