user3486154
user3486154

Reputation: 131

grep, awk or sed to Match and Combine Two Files

I need to match one field from two files and then combine fields from both files.

File 1 will be huge and could have close to 12 million lines. File 2 will only have about 130 lines.

I tried -

egrep -f file1 file2

This only gives me the matching information that's containted in file2.

Is there a way to match information from both files and then combine portions from both files?

Also, I need something that's efficient since file1 is quite large.

All fields are pipe delimited.

Below is the contents of both files as well as the output file. I've included a key to each file.

Any guidance would be greatly appreciated.

File 1:

1234567890|1|6664|NA|KY|0|0|18
2345678901|22|1958|NA|MT|0|0|77
3456789012|3|991B|NA|EM|0|0|13
4567890123|3223|154E|NA|NY|0|0|52
5678901234|8|008F|NA|AZ|0|0|04
6789012345|27|D983|NA|IT|0|0|0

File 1 Key

Field 1: 10 digit number
Field 2: variable length numeric characters
Field 3: 4 alpha-numeric characters
Field 4: 2 alpha characters
Field 5: 2 alpha characters
Field 6: 1 digit number
Field 7: 1 digit number
Field 8: variable length numeric characters

File 2:

6664|Dr. Zhivago|Russian Literature|421
1958|Dr. Spock|Child Psychology|308
991B|Dr. Martin|Shoe Design|406
3223|Dr. House|Diagnostic Skills|440
008F|Dr. Evil|Global Political Economy|326
D983|Dr. Quincy|Processes of Science: Forensics|156

File 2 Key

Field 1: 4 alpha-numeric characters
Field 2: Variable length alpha-numeric characters
Field 3: Variable length alpha-numeric characters
Field 4: 3 digit number

Required Output:

1234567890|1|6664|Russian Literature
2345678901|22|1958|Child Psychology
3456789012|3|991B|Shoe Design
4567890123|3223|154E|Diagnostic Skills
5678901234|8|008F Global Political Economy
6789012345|27|D983|Processes of Science: Forensics

Output Key

From File 1 –

Field 1: 10 digit number
Field 2: variable length numeric characters
Field 3: 4 alpha-numeric characters

From File 2 –

Field 4: Variable length alpha-numeric characters

Upvotes: 0

Views: 621

Answers (1)

Kevin
Kevin

Reputation: 56059

Being relatively small, the relevant contents of file2 will fit into memory. So slurp it up into an array, keyed on the id number, and for each line in file1, check if it's in the array and if so, print the output in the desired format.

awk 'BEGIN{FS=OFS="|"}NR==FNR{a[$1]=$3;next}$3 in a{print $1,$2,$3,a[$3]}' file2 file1

Note that file2 must come first on the command line.

Upvotes: 1

Related Questions