Novice
Novice

Reputation: 633

Getting rid of all words that contain a special character in a textfile

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.

Here's an example of what I want to do:

Input:

@derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag

Output:

I was there and it was awesome!

So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.

Thanks!

Upvotes: 1

Views: 129

Answers (3)

Chris Koknat
Chris Koknat

Reputation: 3451

Here is how it could be done using Perl:

perl -ane 'for $f (@F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file

I am using this input text as my test case:

Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
@derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag

output:

Hello, 
How are you doing? 
I'd like 2.5 cups of piping-hot coffee. 
I was there; it was awesome! 

Command-line options:

  • -n loop around every line of the input file, do not automatically print it

  • -a autosplit mode – split input lines into the @F array. Defaults to splitting on whitespace

  • -e execute the perl code

The perl code splits each input line into the @F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.

The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word

  • ^ starts with

  • [a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)

  • [?!;:,.]? zero or one of the following punctuation: ?!;:,.

  • (|) alternately match

  • [\d.]+ one or more numbers or .

  • $ end

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203684

Your requirements aren't clear at all but this MAY be what you want:

$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!

Upvotes: 0

jazzabeanie
jazzabeanie

Reputation: 445

sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.

  • [[:space:]] is any whitespace character
  • [^a-zA-Z0-9[:space:]] is any special character
  • [^[:space:]]* is any number of non whitespace characters

Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Upvotes: 0

Related Questions