Rczone
Rczone

Reputation: 501

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt

refer.txt contains the following

julie,remo,rob,whitney,james

parse.txt contains

remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0

Now my output.txt should list the files in parse.txt based on the order specified in refer.txt

ex of output.txt should be:

julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0

i have tried the following code:

sort -nru refer.txt parse.txt

but no luck.

please assist me.TIA

Upvotes: 1

Views: 1001

Answers (4)

seane
seane

Reputation: 599

Command

while read line; do
  grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -

Key points

  • For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
  • while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
  • The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to @CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
  • The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

Upvotes: 0

user5321531
user5321531

Reputation: 3265

tr , "\n" refer.txt | cat -n >person_id.txt  # 'cut -n' not posix, use sed and paste

cat person_id.txt | while read person_id person_key
do 
    print "$person_id" > $person_key
done

tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt

cat person_data.txt | while read foreign_key person_data
do 
    person_id="$(<$foreign_key)"
    print "$person_id" " " "$person_data" >>merge.txt
done

sort merge.txt >output.txt

A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:

[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)

[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'

[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested

The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]

This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.

NB: The above code is very unlikely to work out of the box.

NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:

# 1) index data file on required field
cat person_data.txt | while read data
do
    key="$(print "$data" | sed 's/(^[^\/]*)/\1/')"  # alt. `cut -d'/' -f1` ??
    print "$data" >>./person_data/"$key"
done

# 2) run batch job
cat refer_data.txt | while read key
do
    print ./person_data/"$key"
done

However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)

Upvotes: 0

Charles Duffy
Charles Duffy

Reputation: 295433

In pure native bash (4.x):

# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt

# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[@]}"; do
  key=${value%%/*}
  if [[ ${kv[$key]} ]]; then
    kv[$key]+=",$value" # already exists, comma-separate
  else
    kv[$key]="$value"
  fi
done

# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[@]}"; do
  out+=( "${kv[$value]}" )
done

# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt

If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.

Upvotes: 1

anubhava
anubhava

Reputation: 785156

You can do that using gnu-awk:

awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
              {s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt

Output:

julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0

Explanation:

-F/                          # Use field separator as /
-v RS=',|\n'                 # Use record separator as comma or newline
NR == FNR {                  # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the 
                             # records with keys julie, remo, rob etc.
}
{                            # while processing the second file refer.txt
  s = (s)?s "," a[$1]:a[$1]  # aggregate all values by reading key from 2nd file
}
END {print s }               # print all the values

Upvotes: 2

Related Questions