Reputation: 15660
How can extract the lines with the Content-Type info? In some mails, these headers can be in 2 or 3 or even 4 lines, depending how it was sent. This is one example:
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
I tried this regex: ^(Content-.*:(.|\n)*)*
but it grabs everything.
How should I phrase my regex in Java to get only part:
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Upvotes: 2
Views: 5497
Reputation: 34435
This tested script works for me:
import java.util.regex.*;
public class TEST
{
public static void main( String[] args )
{
String subjectString =
"Content-Type: text/plain;\r\n" +
" charset=\"us-ascii\"\r\n" +
"Content-Transfer-Encoding: 7bit\r\n" +
"\r\n" +
"Lorem ipsum dolor sit amet, consectetur adipisicing elit,\r\n" +
"sed do eiusmod tempor incididunt ut labore et dolore magna\r\n" +
"aliqua. Ut enim ad minim veniam, quis nostrud exercitation\r\n" +
"ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n" +
"Duis aute irure dolor in reprehenderit in voluptate velit\r\n" +
"esse cillum dolore eu fugiat nulla pariatur. Excepteur sint\r\n" +
"occaecat cupidatat non proident, sunt in culpa qui officia\r\n" +
"deserunt mollit anim id est laborum.\r\n";
String resultString = null;
Pattern regexPattern = Pattern.compile(
"^Content-Type.*?(?=\\r?\\n\\s*\\n)",
Pattern.DOTALL | Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(subjectString);
if (regexMatcher.find()) {
resultString = regexMatcher.group();
}
System.out.println(resultString);
}
}
It works for text having both valid: \r\n
and (invalid but commonly used in the wild): \n
Unix style line terminations.
Upvotes: 0
Reputation: 109264
Checkout the relevant RFCs for the exact definition of headers. IIRC in essence you need to consider everything with a linebreak and one or more whitespace characters (eg space, nonbreaking space, tab) to be part of the same header line. I also believe that you should collapse the linebreak and whitespace(s) into a single whitespace element (note: there might be more complex rules, so check the RFCs).
Only if the new line directly starts with a non-whitespace character it is the next header, and if it is immediately followed by another linebreak it ends the header section and starts the body section.
BTW: Why not just use JavaMail instead of reinventing the wheel?
Upvotes: 0
Reputation: 9664
You can try this regex
Pattern regex = Pattern.compile("Content-Type.*?(?=^\\s*\n?\r?$)",
Pattern.DOTALL | Pattern.MULTILINE);
Upvotes: 1
Reputation: 26940
Pattern regex = Pattern.compile("^Content-Type(?:.|\\s)*?(?=\n\\s+\n)");
This will match everything which starts with Content-Type until the first completely empty line.
Upvotes: 2