Reputation: 97
I need to extract some specifiek parts from a 'very big > 3GB' text file.
,(1,'[email protected]',0,0,1,1,0,0,1), (2,'[email protected]',1,0,3,1,7,0,1), (3,'[email protected]',0,0,0,1,0,0,1), (4,'[email protected]',1,0,7,1,1,1,3), (5,'[email protected]',0,0,3,1,1,0,1), (6,'[email protected]',1,0,5,1,6,1,1),
And I need 'first field, email, third field' so (without the ''
) and by line as below..
1,[email protected],0 2,[email protected],1 3,[email protected],0
etc..
And if possible I want extract the domain names (like 1,[email protected],hotmail.com,0 )
I can extract the emails with the following:
grep -o -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b' test
and I tried a lot more...
like egrep -o -E '([^),(^]+)' test
, and set
I hope someone get help me out!
Upvotes: 0
Views: 55
Reputation: 5533
You may use tr
to split the very long line to multi lines.
Then use tr
to remove the special chars like ()
.
Finally, use AWK
to print output the expected columns.
tr ")('" "\n " < file | tr -d "[ ]" |awk -F"," '{print $2","$3","$4}'
UPDATE
Then just split
the email or hostname would solve your problem.
tr ")" "\n" < file | tr -d "[ (']" |awk -F"," '{ split($3, a, "@"); print $2","$3","a[2]","$4;}'
FINAL UPDATE
Add a check, only print the legal lines.
tr ")" "\n" < file | tr -d "[ (']" |awk -F"," '{ split($3, a, "@"); if (NF>2) {print $2","$3","a[2]","$4;}}'
OUTPUT
1,[email protected],hotmail.com,0
2,[email protected],hotmail.com,1
3,[email protected],live.com,0
Upvotes: 1