Reputation: 557

regular expression to remove unwanted text from string

I am trying to extract only few information from a big string like

[[["좋은","good","joh-eun",""]],[["adjective",[["좋은",["good","nice","pretty","admirable","canny","tenacious"],,0.38553435]],"good",4],["adverb",["훌륭하게",["wonderfully","good","nicely","beautifully","fine","finely"],,0.00029145498],"good",4]]]

i want to extract the string like this

좋은 - good
좋은 - good,nice,pretty,admirable,canny,tenacious (basically adjectives)
훌륭하게 - wonderfully,good,nicely,beautifully,fine,finely (adverbs)

please help i tried using sed and pipe to cut like

cut --delimiter='"' -f 1-2 and then use sed 's/\[\[\[\"//'

This is giving me first korean 좋은 as result, i am not able to extend this to get desired result! If there is any other better way to achieve this, please suggest. Thanks in advance.

Upvotes: 2

Answers (2)

Tensibai

Reputation: 15784

A little late but in pure regex suitable for sed:

regex: \[\[\["(.*?)","(.*?)"\]\],\[\["(.*?)",\[\["(.*?)",\["(.*?)"\],.*?\]\],.*?\],\["(.*?)",\["(.*?)",\["(.*)"\],.*\]\]\]

Substitution: \1 - \2\n\4 - \5 (\3)\n\7 - \8 (\6)

demo

Assuming there's always adjectives and adverbs brackets in the orignal line... (even if empty)

See the substitution in demo to how to reorg the matches.

Upvotes: 2

glenn jackman

Reputation: 246774

Here's a piece of ruby, but probably any PCRE-equipped tool can do something similar:

ruby -ne '
    $_.gsub(/"/,"")
      .scan(/ (\p{Hangul}+) ,\[? (.+?) \] /x) {|m| puts m[0] + " - " + m[1]}
' <<END
[[["좋은","good","joh-eun",""]],[["adjective",[["좋은",["good","nice","pretty","admirable","canny","tenacious"],,0.38553435]],"good",4],["adverb",["훌륭하게",["wonderfully","good","nicely","beautifully","fine","finely"],,0.00029145498],"good",4]]]
END

좋은 - good,joh-eun,
좋은 - good,nice,pretty,admirable,canny,tenacious
훌륭하게 - wonderfully,good,nicely,beautifully,fine,finely

Too bad the original text isn't in easier to handle JSON.

Thanks to this question for how to match Korean characters.

Upvotes: 1

regular expression to remove unwanted text from string

Answers (2)

Related Questions