Reputation: 93
I want to remove all the special characters except |, _, - and .s from a pipe separated file.
For example, my data file looks like..
ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New*name
QRST|124|New/name
UVWX|353|Name_*%NAME
EFGH|456|New=Name
Eaba|456|New****Name
fdsf|456|New-----Name
iouk|456|New(#$%^)_Name
I have tried the below but couldn't achieve or I'm just half the way.
tr -cd '[:print:]' < temp.txt > newfile -- I still get all the special chars.
tr -cd '[:alnum:]' <temp.txt -- I get only aphanum chars but I want to have a few special chars.
cat temp.txt | sed 's/[a-zA-Z0-9|_-.]//g' | sed '/^$/d' -- I get all the special chars but repetition is there
The below gives me the output as
$ cat temp.txt | sed 's/[a-zA-Z0-9|_-.]//g' | sed '/^$/d' | tr -cd '[:print:]' | sort -u
""""){***+#=**~>>\+*****<(")
If I at least get all the unique special characters, I'll be able to put everything into a sed and replace with null.
My expected output is:
ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name
I need to be looking at specific column if that helps in reducing the code. As said earlier, the code has to include |, _, - characters and remove everything else. Let me know if you guys're looking for any more info.
Upvotes: 2
Views: 3204
Reputation: 133
This should do the trick:
sed -r -e 's#([^a-zA-Z0-9\|_])+#_#g' -e 's/_+/_/g'
However, you've got some inconsistencies between your expected output and the stated goals.
In particular, you state that you want to keep hyphens, but the you keep it on the EFGH line, but removed them from the fdsf line.
Upvotes: 0
Reputation: 1517
awk 'NR>2{sub(/New./,"New_")sub(/_..NAME/,"_NAME")sub(/_.*Name/,"_Name")}1' file
ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name
Upvotes: -1
Reputation: 27215
I hope I got your requirements right:
-
(e.g. ---
) with _
.sed
line in this answer.)|
, and -
with _
.-
and _
(e.g. ---
→ -
).|
-separated field.The following script implements these requirements in the same order (first line is for first requirement and so on). Note that tr
is not line based and interprets newline characters like every other character, therefore we explicitly have to tell tr
to keep the newline character \n
. Also note that -
has to be escaped in tr
's arguments.
f() {
sed 's/---*/_/g' |
tr -c '[:alnum:]|\-\n' _ |
tr -s '\-_' |
sed -E 's/(^|\|)_/\1/g'
}
Use this function like
f <infile >outfile
Upvotes: 3
Reputation: 58391
This might work for you (GNU sed):
sed -E 's/[^[:alnum:]|_.,*=/-]//g;s/[*=/]+/_/g;s/--+|__+/_/g' file
The first substitution removes any unwanted characters.
The second substitution replaces one more *
,=
or /
with a single _
throughout the file.
The third substitution replaces two or more -
or _
with a single _
throughout the file.
N.B. The alternation metacharacter |
and the substitution delimiter /
can represent their real values inside a bracket expression so sed -E 's/[/|]//g' file
will remove all occurrences of /
and |
. Also, the -
within a bracket expression can represent a range, [a-zA-Z0-9]
means any single alphanumeric character equivalent to[[:alnum:]]
, but if it is placed just before the closing bracket, it represents its real value, so sed 's/[a-]//g' file
will remove all occurrences of a
and -
.
The final substitution could be amended to s/(-)-+|(_)_+/\1\2/g
which is equivalent to s/--+/-/g;s/__+/_/g
if the user so wishes to shorten those extraneous characters.
Upvotes: 0
Reputation: 203413
It sounds like by "special character" you mean non-alphanumeric. If so then just use the negation of the [:alnum:]
character class to match those chars, e.g. with any awk in any shell on every UNIX box and only changing column 3 since you said "I need to be looking at specific column":
$ awk 'BEGIN{FS=OFS="|"} {gsub(/[^[:alnum:]-]+|--+/,"_",$3)} 1' file
ABCD|123|Name
EFGH|456|New-Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name
If [^[:alnum:]-]
is wrong then just use whatever character class you want and/or list the specific chars [^*\/%-]
. Note that you don't need to handle |
explicitly in the regexps since there can't be a |
in a |
-separated field.
Upvotes: 3
Reputation: 41456
Why not just some like this:
sed -E 's/[*/_%=#()^$]+|-+/_/g' file
ABCD|123|Name
EFGH|456|New_Name
IJKL|789|New_Name
MNOP|123|New_name
QRST|124|New_name
UVWX|353|Name_NAME
EFGH|456|New_Name
Eaba|456|New_Name
fdsf|456|New_Name
iouk|456|New_Name
Upvotes: 1