patrick
patrick

Reputation: 97

How to grep for some specific parts, from a file?

I need to extract some specifiek parts from a 'very big > 3GB' text file.

,(1,'[email protected]',0,0,1,1,0,0,1),
 (2,'[email protected]',1,0,3,1,7,0,1),
 (3,'[email protected]',0,0,0,1,0,0,1),
 (4,'[email protected]',1,0,7,1,1,1,3),
 (5,'[email protected]',0,0,3,1,1,0,1),
 (6,'[email protected]',1,0,5,1,6,1,1),

And I need 'first field, email, third field' so (without the '') and by line as below..

1,[email protected],0

2,[email protected],1

3,[email protected],0

etc..

And if possible I want extract the domain names (like 1,[email protected],hotmail.com,0 )

I can extract the emails with the following:

grep -o -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b' test

and I tried a lot more... like egrep -o -E '([^),(^]+)' test, and set

I hope someone get help me out!

Upvotes: 0

Views: 55

Answers (1)

luoluo
luoluo

Reputation: 5533

You may use tr to split the very long line to multi lines.

Then use tr to remove the special chars like ().

Finally, use AWK to print output the expected columns.

tr ")('" "\n " < file | tr -d "[ ]" |awk -F"," '{print $2","$3","$4}'


UPDATE

Then just split the email or hostname would solve your problem.

tr ")" "\n" < file | tr -d "[ (']" |awk -F"," '{ split($3, a, "@"); print $2","$3","a[2]","$4;}'


FINAL UPDATE

Add a check, only print the legal lines.

tr ")" "\n" < file | tr -d "[ (']" |awk -F"," '{ split($3, a, "@"); if (NF>2) {print $2","$3","a[2]","$4;}}'

OUTPUT

1,[email protected],hotmail.com,0
2,[email protected],hotmail.com,1
3,[email protected],live.com,0

Upvotes: 1

Related Questions