canadian_scholar
canadian_scholar

Reputation: 1325

Using sed to find-and-replace in a text file using strings from another text file

I have two files as follows. The first is sample.txt:

new haven co-op toronto on $1245
joe schmo co-op powell river bc $4444

The second is locations.txt:

toronto
powell river
on
bc

We'd like to use sed to produce a marked up sample-new.txt that added ; before and after each of these. So that the final string would appear like:

new haven co-op ;toronto; ;on; $1245
joe schmo co-op ;powell river; ;bc; $4444

Is this possible using bash? The actual files are much longer (thousands of lines in each case) but as a one-time job we're not too concerned about processing time.

--- edited to add ---

My original approach was something like this:

cat locations.txt | xargs -i sed 's/{}/;/' sample.txt

But it only ran the script once per pattern, as opposed to the methods you've proposed here.

Upvotes: 3

Views: 65

Answers (2)

anishsane
anishsane

Reputation: 20980

Using awk:

awk 'NR==FNR{a[NR]=$0; next;} {for(i in a)gsub("\\<"a[i]"\\>",";"a[i]";"); print} '  locations.txt sample.txt

Using awk+sed

sed -f <(awk '{print "s|\\<"$0"\\>|;"$0";|g"}' locations.txt) sample.txt

Same using pure sed:

sed -f <(sed 's/.*/s|\\<&\\>|\;&\;|g/' locations.txt) sample.txt

(After you show your coding attempts, I will add the explanation of why this works.)

Upvotes: 2

ghoti
ghoti

Reputation: 46896

Just to complete your set of options, you can do this in pure bash, slowly:

#!/usr/bin/env bash

readarray -t places < t2

while read line; do
  for place in "${places[@]}"; do
      line="${line/ $place / ;$place; }"
  done
  echo "$line"
done < t1

Note that this likely won't work as expected if you include places that are inside other places, for example "niagara on the lake" which is in "on":

foo bar co-op ;niagara ;on; the lake; on $1

Instead, you might want to do more targeted pattern matching, which will be much easier in awk:

#!/usr/bin/awk -f

# Collect the location list into the index of an array
NR==FNR {
  places[$0]
  next
}

# Now step through the input file
{

  # Handle two-letter provinces
  if ($(NF-1) in places) {
      $(NF-1)=";" $(NF-1) ";"
  }

  # Step through the remaining places doing substitutions as we find matches
  for (place in places) {
    if (length(place)>2 && index($0,place)) {
      sub(place,";"place";")
    }
  }

}

# Print every line
1

This works for me using the data in your question:

$ cat places
toronto
powell river
niagara on the lake
on
bc
$ ./tst places input
new haven co-op ;toronto; ;on; $1245
joe schmo co-op ;powell river; ;bc; $4444
foo nar co-op ;niagara on the lake; ;on; $1

You may have a problem if your places file contains an actual non-province comprising two letters. I'm not sure if such things exist in Canada, but if they do, you'll either need to tweak such lines manually, or make the script more complex by handling provinces separately from cities.

Upvotes: 1

Related Questions