Reputation: 11
I am trying to convert this input from file.txt
a,b;c^d"e}
f;g,h!;i8j-
into this output
a,b,c,d,e,f,g,h,i,j
with awk
The best I did so far is
awk '$1=$1' FS="[!;^}8-]" OFS="," file.txt
"
as a special character ? "
doesn`t work,,
in the output and delete the last ,
Upvotes: 1
Views: 107
Reputation: 770
If you are ok with Perl solution, here is an one-liner;
perl -ne '$_ =~ s/[^[:alnum:]]//g; print join(",", split//, $_)'
which outputs:
a,b,c,d,ef,g,h,i,8,j
Simply, you are substituting characters that are not alpha-numeric with nothing.
Upvotes: 0
Reputation: 2895
echo "${input_data}" |
mawk 'NF-=_==$NF' FS='[^[:alpha:]]*' OFS=, RS=
a,b,c,d,e,f,g,h,i,j
if there's possibility of leading edge seps, use this instead :
echo ']a['
gawk 'gsub("^,|,$",_,$!(NF=NF))^_' FS='[^[:alpha:]]*' OFS=, RS=
a
** side note : beware that nawk
has an unconventional definition of what it considers [[:alpha:]]
:
reparse <[[:alpha:]]+>
cclenter : in = | . .. |, out =
|ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
ªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ
ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ|
Even though locale is set as LANG="en_US.UTF-8"
, nawk
's idea of [[:alpha:]]
is neither ASCII
nor full Unicode
-
something resembling, but not necessarily identical, to a legacy 8-bit locale like ISO-8859-...
Upvotes: 0
Reputation: 185590
KISS:
$ grep -o '[a-z]' file | paste -sd ',' -
a,b,c,d,e,f,g,h,i,j
Should works on most GNU/Linux
, even busybox
& freeBSD
(the -
is then mandatory)
Upvotes: 1
Reputation: 37464
One in awk (not for all awks, tested successfully in gawk, mawk, busybox awk and Macos awk version 20200816, unsuccessfully in Debian's awk version 20121220 aka original-awk. Limitations in locales as well.)
$ awk -v RS="^$" '{ # read whole file in
gsub(/[^a-z]+/,",") # replace all non lowercase alphabet substrings with a comma
sub(/,$/,"") # remove trailing comma
}1' file # output
Output:
a,b,c,d,e,f,g,h,i,j
Upvotes: 2
Reputation: 7831
If ed
is available/acceptable.
The script.ed
%s/[^a-z]/ /g
%s/[[:blank:]]\{1,\}/,/g
g/./;j\
s/,$//
,p
Q
Now run
ed -s file.txt < script.ed
Upvotes: 1
Reputation: 204381
Using any POSIX awk and assuming you want any non-alphabetic character to act as a field separator:
$ awk -F '[^[:alpha:]]+' -v OFS=',' '{printf "%s", p; $1=$1; p=$0} END{sub(OFS"$","",p); print p}' file
a,b,c,d,e,f,g,h,i,j
If you really do just want to use the specific set of characters in your question as the field separators then just change [^[:alpha:]]+
to [!;^}8"-]+
Upvotes: 2
Reputation: 163527
Using gnu-sed
replace 1 or more chars other than a-z with a comma. Then remove all leading and trailing comma's
sed -Ez 's/[^a-z]+/,/g; s/^,+|,+$//' file
Output
a,b,c,d,e,f,g,h,i,j
Upvotes: 0
Reputation: 265687
If you only want to replace non-letter characters with commas and squeeze repeated commas, tr
is your friend:
tr -sc '[:alpha:]' ','
This will leave a trailing comma though. You could use sed
to remove/replace it:
tr -sc '[:alpha:]' ',' | sed 's/,$/\n/'
Another possibility is to split each "item" into its own line (with tr
or grep -o
), then use paste
to combine the lines again:
tr -sc '[:alpha:]' '\n' | paste -sd,
Upvotes: 2
Reputation: 36700
I would harness GNU AWK
for this task following way, let file.txt
content be
a,b;c^d"e} f;g,h!;i8j-
then
awk 'BEGIN{FPAT="[a-z]";OFS=","}{$1=$1;print}' file.txt
gives output
a,b,c,d,e,f,g,h,i,j
Explanation: I inform GNU AWK
that field is single lowercase ASCII letter using FPAT
, and output field separator (OFS
) is ,
, then for each line I do $1=$1
to trigger line rebuild and print
line.
(tested in GNU Awk 5.0.1)
Upvotes: 2