Santhosh Ram
Santhosh Ram

Reputation: 93

Remove unknown special characters from file

I want to remove all the special characters except |, _, - and .s from a pipe separated file.

For example, my data file looks like..

ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New*name
QRST|124|New/name
UVWX|353|Name_*%NAME
EFGH|456|New=Name
Eaba|456|New****Name
fdsf|456|New-----Name
iouk|456|New(#$%^)_Name

I have tried the below but couldn't achieve or I'm just half the way.

tr -cd '[:print:]' < temp.txt > newfile -- I still get all the special chars.
tr -cd '[:alnum:]' <temp.txt -- I get only aphanum chars but I want to have a few special chars.
cat temp.txt | sed 's/[a-zA-Z0-9|_-.]//g' | sed '/^$/d' -- I get all the special chars but repetition is there

The below gives me the output as

$ cat temp.txt | sed 's/[a-zA-Z0-9|_-.]//g' | sed '/^$/d' | tr -cd '[:print:]' | sort -u
""""){***+#=**~>>\+*****<(")

If I at least get all the unique special characters, I'll be able to put everything into a sed and replace with null.

My expected output is:

ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name

I need to be looking at specific column if that helps in reducing the code. As said earlier, the code has to include |, _, - characters and remove everything else. Let me know if you guys're looking for any more info.

Upvotes: 2

Views: 3204

Answers (6)

jwan
jwan

Reputation: 133

This should do the trick:

sed -r -e 's#([^a-zA-Z0-9\|_])+#_#g' -e 's/_+/_/g'

However, you've got some inconsistencies between your expected output and the stated goals.

In particular, you state that you want to keep hyphens, but the you keep it on the EFGH line, but removed them from the fdsf line.

Upvotes: 0

Claes Wikner
Claes Wikner

Reputation: 1517

  awk 'NR>2{sub(/New./,"New_")sub(/_..NAME/,"_NAME")sub(/_.*Name/,"_Name")}1' file
ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name

Upvotes: -1

Socowi
Socowi

Reputation: 27215

I hope I got your requirements right:

  1. Replace groups of multiple - (e.g. ---) with _.
    (If that's a typo in your example, simply remove the sed line in this answer.)
  2. Replace all symbols other than letters, numbers, |, and - with _.
  3. Squeeze repeated - and _ (e.g. ----).
  4. Remove leading underscores in every |-separated field.

The following script implements these requirements in the same order (first line is for first requirement and so on). Note that tr is not line based and interprets newline characters like every other character, therefore we explicitly have to tell tr to keep the newline character \n. Also note that - has to be escaped in tr's arguments.

f() {
     sed 's/---*/_/g' |
     tr -c  '[:alnum:]|\-\n' _ |
     tr -s  '\-_' |
     sed -E 's/(^|\|)_/\1/g'
}

Use this function like

f  <infile  >outfile

Upvotes: 3

potong
potong

Reputation: 58391

This might work for you (GNU sed):

sed -E 's/[^[:alnum:]|_.,*=/-]//g;s/[*=/]+/_/g;s/--+|__+/_/g' file

The first substitution removes any unwanted characters.

The second substitution replaces one more *,= or / with a single _ throughout the file.

The third substitution replaces two or more - or _ with a single _ throughout the file.

N.B. The alternation metacharacter | and the substitution delimiter / can represent their real values inside a bracket expression so sed -E 's/[/|]//g' file will remove all occurrences of / and |. Also, the - within a bracket expression can represent a range, [a-zA-Z0-9] means any single alphanumeric character equivalent to[[:alnum:]], but if it is placed just before the closing bracket, it represents its real value, so sed 's/[a-]//g' file will remove all occurrences of a and -.

The final substitution could be amended to s/(-)-+|(_)_+/\1\2/g which is equivalent to s/--+/-/g;s/__+/_/g if the user so wishes to shorten those extraneous characters.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203413

It sounds like by "special character" you mean non-alphanumeric. If so then just use the negation of the [:alnum:] character class to match those chars, e.g. with any awk in any shell on every UNIX box and only changing column 3 since you said "I need to be looking at specific column":

$ awk 'BEGIN{FS=OFS="|"} {gsub(/[^[:alnum:]-]+|--+/,"_",$3)} 1' file
ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name

If [^[:alnum:]-] is wrong then just use whatever character class you want and/or list the specific chars [^*\/%-]. Note that you don't need to handle | explicitly in the regexps since there can't be a | in a |-separated field.

Upvotes: 3

Jotne
Jotne

Reputation: 41456

Why not just some like this:

sed -E 's/[*/_%=#()^$]+|-+/_/g' file
ABCD|123|Name
EFGH|456|New_Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name

Upvotes: 1

Related Questions