Reputation: 173

Filter block of text based on a single value in the block of text

I have a LDIF file with about 23K of user objects separated by blank lines. Each user object (block of text in the file) has a workforceid value, and I wanted to remove the user objects (entire block of text) for any user objects which has a workforceid with 5 chars. There are user objects from two different companies and one company has a 5 digit IDs and the others have a 8 digit IDs and I need to conduct data processing on the user objects with 8 digit IDs. Data set example:

# zhayangy, Company
dn: cn=zhayangy,o=Company
workforceid: 26000180
street: 699 axian Road
st: Shanghai
preferredname: Zhao, Yangyang
physicaldeliveryofficename: ABC01:
ou: IT Engineering
mail: [email protected]
givenname: Yangyang
fullname: Yangyang Zhao
employeetype: Cont
employeestatus: Active
costcenter: ABCD501641
companycategory: abc.com
co: China
city: Shanghai
uid: zhayangy
sn: Zhao
cn: zhayangy
objectclass: inetOrgPerson
objectclass: ApplicationAttrs
objectclass: organizationalPerson
objectclass: Person
objectclass: LoginProperties
objectclass: Top
objectclass: PasswordUser
objectclass: UserAux
objectclass: FolderUser
objectclass: eSystem
objectclass: pwUser
objectclass: AuthAttrs

# mikhaylo, Company
dn: cn=mikhaylo,o=Company
workforceid: 76000838
street: Gradskoe shoe, 11A block 1
preferredname: Mikhaylov, Vladislav
postalcode: 12345
physicaldeliveryofficename: ABW02:
ou: Presales ABCE
mail: [email protected]
givenname: Vladislav
fullname: Vladislav Mikhaylov
employeetype: Employee
employeestatus: Active
costcenter: ABCA500189
companycategory: abc.com
co: Russian Federation
city: Moscow
uid: mikhaylo
sn: Mikhaylov
cn: mikhaylo
objectclass: inetOrgPerson
objectclass: ApplicationAttrs
objectclass: organizationalPerson
objectclass: Person
objectclass: LoginProperties
objectclass: Top
objectclass: PasswordUser
objectclass: UserAux
objectclass: FolderUser
objectclass: eSystem
objectclass: pwUser
objectclass: AuthAttrs

Using the below command will search return all the records that have workforceid, but I think that is only if the work force ID is the second entry. It would be nice to have a command that finds the workforceid and count the length of the value regardless of where it falls in the object.

Basically I need to some how add the checking for the length such as: if(length($2) == 5 ), but $2 is the second row in the block of text and not the second column in the workforceid row or column. Depeding on how you look at it.

awk -v RS='' '/workforceid/ {if ( length($7) == 5 ) print $0}' ORS='\n\n' fullextract.ldif

Thanks in advance

Upvotes: 1

Answers (3)

Ed Morton

Reputation: 203655

The ID you're interested in is $4, not $2 or $7 and all you need is:

awk -v RS= -v ORS='\n\n' 'length($4) == 8' fullextract.ldif

You could've just printed the fields to see that.

If it can be anywhere:

awk -v RS= -v ORS='\n\n' '/(^|\n)workforceid: [0-9]{8}(\n|$)/' fullextract.ldif

The more robust, general way to approach the problem of data with tag: value pairs is to create an array that stores them and then operate on the array, e.g.:

awk '
NF {
    rec = rec $0 ORS
    tag = val = $0
    sub(/:.*/,"",tag)
    sub(/[^:]+: /,"",val)
    tag2val[tag] = val
    next
}
{ prt(); rec=""; delete tag2val }
END { prt() }
function prt() {
    if ( length(tag2val["workforceid"]) == 8 ) {
        print rec
    }
}
' file

With that it's trivial to add additional test on other fields, only print specific fields, etc. With your particular data you'd have to deal with the "objectclass" fields all having the same tag if you wanted to test or print them individually but that's easily dealt with (e.g. add a counter to uniquely identify each in val2tag[] or a separate array just for them, maybe indexed by their values so you can easily use in to test for their presence), however you want it handled.

Upvotes: 1

Michael Vehrs

Reputation: 3363

I'm surprised that works. $7 does not seem to be the work force id. Anyway, here's my solution:

awk -v RS='' -v ORS='\n\n' '/workforceid: [0-9]{8}/' ldif

In other words, if the work force id consists of eight digits, print the record, otherwise don't.

Upvotes: 0

H-man

Reputation: 173

I think I got the answer here after testing. Please let me know if I am wrong. I'm not confident that it is correct, but I moved the "workforceid" to a different location in the object and it gives me the same count. So I think got it.

awk -v RS='' '/workforceid/ {if ( length($7) == 5 ) print $0}' ORS='\n\n' fullextract.ldif

Upvotes: 0

Filter block of text based on a single value in the block of text

Answers (3)

Related Questions