Reputation: 21
I need to use a dictionary database, but most of it is some alphanumeric useless stuff, and the interesting fields are either non alphanumeric (such as chinese characters) or inside some brackets. I searched a lot, learned about a lot of tools like sed, awk, grep, ect I even thought about creating a Python script to sort it out, but I never managed to find of a solution.
A line of the database looks like this:
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
I need it to be like this :
助 ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
Ho can I do this using any of the tools mentioned above?
Upvotes: 1
Views: 152
Reputation: 316
Using shell script (Bash):
#!/bin/bash
string="助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"
echo "" > tmpfield
for field in $string
do
if [ "${field:0:1}" != "{" ];then #
echo $field|sed "s/[a-zA-Z0-9 .-]/ /g" >> tmpfield
else
echo $field >> tmpfield
fi
done
#convert rows to one column
cat tmpfield | awk 'NF'|awk 'BEGIN { ORS = " " } { print }'
My output:
nampt@nampt-desktop:/mnt$ bash 1.bash
助 ジョ たす ける たす かる す ける すけ {help} {rescue} {assist}
Upvotes: 0
Reputation: 96171
Here is a Python solution if you would still like one:
import re
alpha_brack = re.compile(r"([a-zA-Z0-9.\-]+)|({.*?})")
my_string = """
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367
DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4
Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"""
match = alpha_brack.findall(my_string)
new_string = my_string
for g0, _ in match: # only care about first group!
new_string = new_string.replace(g0,'',1) # replace only first occurence!
final = re.sub(r'\s{2,}',' ', new_string) # finally, clean up whitespace
print(final)
My results:
'助ジョ たすける たすかる すける すけ {help} {rescue} {assist}'
Upvotes: 1
Reputation: 20980
Using perl
:
perl -ne '
m/(.*?)({.*)/; # Split based on '{'
my $a=$1; my $b=$2;
$a =~ s/[[:alnum:]-.]//g; #Remove alphabets, numbers, '.', '-' (add more characters as you need.)
$a =~ s/ +/ /g; # Compress spaces.
print "$a $b\n"; #Print 2 parts and a newline
' dbfile.txt
Explanation in the inline comments.
Similar logic with sed
:
sed '
h; #Save line in hold space.
s/{.*//; # Remove 2nd part
s/[a-zA-Z0-9.-]//g; # Remove all alphabets, numbers, . & -
s/ */ /g; # Compress spaces
x; #Save updated 1st part in hold space, take back the complete line in pattern space
s/[^{]*{/{/; #Remove first part
x; #Swap hold & pattern space again.
G; # Append 2nd part to first part separated by newline
s/\n//; # Remove newline.
' dbfile.txt
Upvotes: 0
Reputation: 7952
Personally, given your example line, I'd sed out all alphanumeric characters that start and end with a space:
sed -i 's/ [a-zA-Z0-9 .-]+ / /g'
should be close to what you need. You may have to add more special characters if the text you're wiping out contains other things. This is an in-place substitution for a single space (essentially deleting).
No linux box handy to verify this one... it may require a little massaging.
Also worth mentioning, this will not work if the brackets can contain two spaces: {test results found}
as it'll blow away the results
Upvotes: 0