Jay Gray
Jay Gray

Reputation: 1726

How to awk to read a dictionary and replace words in a file?

We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):

The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences

Each sentence in "source-A" is on its own line and terminates with a newline (\n)

We have a dictionary/conversion file ("converse-B") that looks like this:

aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes

"converse-B" is a two column, tab delimited file. Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)

How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?

For example, the "output-C" would look like this:

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.

The tricky part is the term potato.

If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.

In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.

An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.

Upvotes: 0

Views: 548

Answers (1)

karakfa
karakfa

Reputation: 67507

sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.

$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences

file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.

NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.

Upvotes: 1

Related Questions