user2979872
user2979872

Reputation: 467

awk finding a column and trimming

I have a text file with irregular structure like following

first_name1 last_name1 designation1 email1 phone_number1
first_name2 last_name2 designation2 email2
first_name3 last_name3 designation3 email3 phone_number3 address3

As you see email could be the last column, second last column or the third last column. This means one simply cannot use $NF to get email. My goal is to get email address wherever it is on the line and then extract the portion before @ so for instance email1 = [email protected] then I want to extract foobar. How can i write an awk query to extract first portion of the email address. I tried this but it is looking for exact match. How can i make it into Regex to get the job done.

awk '{for(i=1;i<=NF;i++){ if($i=="[email protected]"){print $i} } }' users.txt 

Upvotes: 0

Views: 70

Answers (2)

e0k
e0k

Reputation: 7161

You are comparing $i to a string "[email protected]", so yes of course this will only make an exact comparison. What it seems you are looking for is whether or not $i matches (~) a regular expression (/.../ instead of "..."), then tailor the regex to your needs. Try something like:

awk '{for(i=1;i<=NF;++i){if ($i ~ /.+@.+/){sub(/@.*$/, "", $i); print $i; next}}}'

The regex /.+@.+/ matches a string with a @ in it, and some non-empty thing before it and after it. This will not match, for example @foobar or foobar@, or just @. You might want to consider using something more like /.+@.+\..+/ which would match (something)@(something).(something) since domain names usually have a . in them. You can tailor this regex to be more specific, if you wish.

The sub(/@.*$/, "", $i) means to substitute in $i everything after (and including) the first @ until the end of the line ($) with an empty string "", thus stripping out the part before the @ (i.e. the username). The print $i prints it, and the next moves on to the next line (skipping any remaining fields for the current record).

Upvotes: 2

Nicolas
Nicolas

Reputation: 7081

I don't know awk at all but I looked the regex reference up and this should be supported: \b([^ ]*@.*?)($|[^\w@.]) in which group 1 matches the email. This just search for something after a word boundary that contains @. The match ends at the next non word character, excluding @ and ..

Upvotes: 0

Related Questions