Ari
Ari

Reputation: 885

Validate text file presents expected fields for each data set

If one has a document in this format:

Data point 1:
    field 1:
    field 2:
    field 3:

Data point 2:
    field 1:
    field 2:
    field 3:

Data point 3:
etc...

You could verify each field exists for each data point manually by scrolling through thousands of lines in a file, but that would be inefficient and time consuming.

I have thought about splitting the file and comparing each section using diff, but again, that would be prone to issues if there is difference in line count, or formatting.

So how would you process a file and verify each point has the right number and expected fields?

Upvotes: 0

Views: 59

Answers (2)

Fred
Fred

Reputation: 7015

Create a bash script starting with :

#!/bin/bash

Inside that script, create a function that reads from standard input checking for each field in a single "record", like so :

check_record()
{
   local LINE
   IFS= read -r LINE
   [[ "$LINE" =~ ^[[:space:]]*field 1: ]] || return 1
   IFS= read -r LINE 
   [[ "$LINE" =~ ^[[:space:]]*field 2: ]] || return 1
   ...
}

The function returns 0 (true) if the record is OK, and 1 otherwise.

Then create a function that searches for the line indicating a record starts :

find_records()
{
   local LINE
   while IFS= read -r LINE
   do
     [[  "$LINE" =~ ^Data ]]  || continue
     check_record || echo "Bad record: $LINE"
   done
}

Finally, add a line (at the end, outside both functions) that will pipe the file passed as first argument to that function.

find_records <"$1"

You may want to add error checking, the details of what you allow or not (e.g. empty lines) in your data file could vary, but that should convey the basic idea.

Please note use is made of bash-specific [[ ]] conditionals and =~ pattern matching, please ask if you need explanations.

Upvotes: 1

user8017719
user8017719

Reputation:

Awk is able to split a file on "empty lines" if the RS (record separator) is set to null (nothing), something like this:

awk -RS '' '…'

Then, awk could also split on each line (each newline) into fields. So, a simple count of fields is very simple to implement in awk:

awk -v RS='' -v FS='\n' '(NF!=4){print $1}' "infile"

If more complex selections of the fields are required they need to be implemented.

Upvotes: 0

Related Questions