Reputation: 131
I need to match one field from two files and then combine fields from both files.
File 1 will be huge and could have close to 12 million lines. File 2 will only have about 130 lines.
I tried -
egrep -f file1 file2
This only gives me the matching information that's containted in file2.
Is there a way to match information from both files and then combine portions from both files?
Also, I need something that's efficient since file1 is quite large.
All fields are pipe delimited.
Below is the contents of both files as well as the output file. I've included a key to each file.
Any guidance would be greatly appreciated.
File 1:
1234567890|1|6664|NA|KY|0|0|18
2345678901|22|1958|NA|MT|0|0|77
3456789012|3|991B|NA|EM|0|0|13
4567890123|3223|154E|NA|NY|0|0|52
5678901234|8|008F|NA|AZ|0|0|04
6789012345|27|D983|NA|IT|0|0|0
File 1 Key
Field 1: 10 digit number
Field 2: variable length numeric characters
Field 3: 4 alpha-numeric characters
Field 4: 2 alpha characters
Field 5: 2 alpha characters
Field 6: 1 digit number
Field 7: 1 digit number
Field 8: variable length numeric characters
File 2:
6664|Dr. Zhivago|Russian Literature|421
1958|Dr. Spock|Child Psychology|308
991B|Dr. Martin|Shoe Design|406
3223|Dr. House|Diagnostic Skills|440
008F|Dr. Evil|Global Political Economy|326
D983|Dr. Quincy|Processes of Science: Forensics|156
File 2 Key
Field 1: 4 alpha-numeric characters
Field 2: Variable length alpha-numeric characters
Field 3: Variable length alpha-numeric characters
Field 4: 3 digit number
Required Output:
1234567890|1|6664|Russian Literature
2345678901|22|1958|Child Psychology
3456789012|3|991B|Shoe Design
4567890123|3223|154E|Diagnostic Skills
5678901234|8|008F Global Political Economy
6789012345|27|D983|Processes of Science: Forensics
Output Key
From File 1 –
Field 1: 10 digit number
Field 2: variable length numeric characters
Field 3: 4 alpha-numeric characters
From File 2 –
Field 4: Variable length alpha-numeric characters
Upvotes: 0
Views: 621
Reputation: 56059
Being relatively small, the relevant contents of file2 will fit into memory. So slurp it up into an array, keyed on the id number, and for each line in file1, check if it's in the array and if so, print the output in the desired format.
awk 'BEGIN{FS=OFS="|"}NR==FNR{a[$1]=$3;next}$3 in a{print $1,$2,$3,a[$3]}' file2 file1
Note that file2 must come first on the command line.
Upvotes: 1