Reputation: 21

How to delete alphanumeric words out of a Unicode file

I need to use a dictionary database, but most of it is some alphanumeric useless stuff, and the interesting fields are either non alphanumeric (such as chinese characters) or inside some brackets. I searched a lot, learned about a lot of tools like sed, awk, grep, ect I even thought about creating a Python script to sort it out, but I never managed to find of a solution.

A line of the database looks like this:

助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}

I need it to be like this :

助 ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}

Ho can I do this using any of the tools mentioned above?

Upvotes: 1

Answers (4)

Nam Pham

Reputation: 316

Using shell script (Bash):

#!/bin/bash

string="助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"

echo "" > tmpfield

for field in $string
do
    if [ "${field:0:1}" != "{" ];then               #
        echo $field|sed "s/[a-zA-Z0-9 .-]/ /g" >> tmpfield
    else
        echo $field >> tmpfield
    fi
done 

#convert rows to one column

 cat tmpfield | awk 'NF'|awk 'BEGIN { ORS = " " } { print }'

My output:

nampt@nampt-desktop:/mnt$ bash 1.bash 
  助 ジョ たす ける たす かる す ける すけ {help} {rescue} {assist}

Upvotes: 0

juanpa.arrivillaga

Reputation: 96171

Here is a Python solution if you would still like one:

import re
alpha_brack = re.compile(r"([a-zA-Z0-9.\-]+)|({.*?})")

my_string = """
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 
DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 
Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"""

match = alpha_brack.findall(my_string)

new_string = my_string

for g0, _ in match: # only care about first group!
    new_string = new_string.replace(g0,'',1) # replace only first occurence!

final = re.sub(r'\s{2,}',' ', new_string) # finally, clean up whitespace

print(final)

My results:

'助ジョたすけるたすかるすけるすけ {help} {rescue} {assist}'

Upvotes: 1

anishsane

Reputation: 20980

Using perl:

perl -ne '
    m/(.*?)({.*)/; # Split based on '{'
    my $a=$1; my $b=$2;
    $a =~ s/[[:alnum:]-.]//g; #Remove alphabets, numbers, '.', '-' (add more characters as you need.)
    $a =~ s/ +/ /g; # Compress spaces.
    print "$a $b\n"; #Print 2 parts and a newline
' dbfile.txt

Explanation in the inline comments.

Similar logic with sed:

sed '
     h; #Save line in hold space.
     s/{.*//; # Remove 2nd part
     s/[a-zA-Z0-9.-]//g; # Remove all alphabets, numbers, . & -
     s/  */ /g; # Compress spaces
     x; #Save updated 1st part in hold space, take back the complete line in pattern space
     s/[^{]*{/{/; #Remove first part
     x; #Swap hold & pattern space again.
     G; # Append 2nd part to first part separated by newline
     s/\n//; # Remove newline.
     ' dbfile.txt

Upvotes: 0

TemporalWolf

Reputation: 7952

Personally, given your example line, I'd sed out all alphanumeric characters that start and end with a space:

sed -i 's/ [a-zA-Z0-9 .-]+ / /g' should be close to what you need. You may have to add more special characters if the text you're wiping out contains other things. This is an in-place substitution for a single space (essentially deleting).

No linux box handy to verify this one... it may require a little massaging.

Also worth mentioning, this will not work if the brackets can contain two spaces: {test results found} as it'll blow away the results

Upvotes: 0

How to delete alphanumeric words out of a Unicode file

Answers (4)

Related Questions