Carven
Carven

Reputation: 15660

Regex to extract Content-Type

How can extract the lines with the Content-Type info? In some mails, these headers can be in 2 or 3 or even 4 lines, depending how it was sent. This is one example:

Content-Type: text/plain;
    charset="us-ascii"
Content-Transfer-Encoding: 7bit

Lorem ipsum dolor sit amet, consectetur adipisicing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna 
aliqua. Ut enim ad minim veniam, quis nostrud exercitation 
ullamco laboris nisi ut aliquip ex ea commodo consequat. 
Duis aute irure dolor in reprehenderit in voluptate velit 
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint 
occaecat cupidatat non proident, sunt in culpa qui officia 
deserunt mollit anim id est laborum.

I tried this regex: ^(Content-.*:(.|\n)*)* but it grabs everything.

How should I phrase my regex in Java to get only part:

Content-Type: text/plain;
    charset="us-ascii"
Content-Transfer-Encoding: 7bit

Upvotes: 2

Views: 5497

Answers (5)

ridgerunner
ridgerunner

Reputation: 34435

This tested script works for me:

import java.util.regex.*;
public class TEST
{
    public static void main( String[] args )
    {
        String subjectString =
            "Content-Type: text/plain;\r\n" +
            "    charset=\"us-ascii\"\r\n" +
            "Content-Transfer-Encoding: 7bit\r\n" +
            "\r\n" +
            "Lorem ipsum dolor sit amet, consectetur adipisicing elit,\r\n" +
            "sed do eiusmod tempor incididunt ut labore et dolore magna\r\n" +
            "aliqua. Ut enim ad minim veniam, quis nostrud exercitation\r\n" +
            "ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n" +
            "Duis aute irure dolor in reprehenderit in voluptate velit\r\n" +
            "esse cillum dolore eu fugiat nulla pariatur. Excepteur sint\r\n" +
            "occaecat cupidatat non proident, sunt in culpa qui officia\r\n" +
            "deserunt mollit anim id est laborum.\r\n";
        String resultString = null;
        Pattern regexPattern = Pattern.compile(
            "^Content-Type.*?(?=\\r?\\n\\s*\\n)",
            Pattern.DOTALL | Pattern.CASE_INSENSITIVE |
            Pattern.UNICODE_CASE | Pattern.MULTILINE);
        Matcher regexMatcher = regexPattern.matcher(subjectString);
        if (regexMatcher.find()) {
            resultString = regexMatcher.group();
        } 
        System.out.println(resultString);
    }
}

It works for text having both valid: \r\n and (invalid but commonly used in the wild): \n Unix style line terminations.

Upvotes: 0

Mark Rotteveel
Mark Rotteveel

Reputation: 109264

Checkout the relevant RFCs for the exact definition of headers. IIRC in essence you need to consider everything with a linebreak and one or more whitespace characters (eg space, nonbreaking space, tab) to be part of the same header line. I also believe that you should collapse the linebreak and whitespace(s) into a single whitespace element (note: there might be more complex rules, so check the RFCs).

Only if the new line directly starts with a non-whitespace character it is the next header, and if it is immediately followed by another linebreak it ends the header section and starts the body section.

BTW: Why not just use JavaMail instead of reinventing the wheel?

Upvotes: 0

Narendra Yadala
Narendra Yadala

Reputation: 9664

You can try this regex

Pattern regex = Pattern.compile("Content-Type.*?(?=^\\s*\n?\r?$)", 
                                 Pattern.DOTALL | Pattern.MULTILINE);

Upvotes: 1

hllau
hllau

Reputation: 10569

^Content-(.|\n)*\n\n This will match until the blank line.

Upvotes: 1

FailedDev
FailedDev

Reputation: 26940

Pattern regex = Pattern.compile("^Content-Type(?:.|\\s)*?(?=\n\\s+\n)");

This will match everything which starts with Content-Type until the first completely empty line.

Upvotes: 2

Related Questions