S.M_Emamian
S.M_Emamian

Reputation: 17383

Extract all images from a html string

My web service returns a html string like below :

 {"content":"[caption id=\"attachment_7691\" align=\"aligncenter\" width=\"300\"]<img class=\"wp-image-7691 size-medium\" src=\"http:\/\/smsbaz.org\/wp-content\/uploads\/2015\/07\/funny-sms-exams-300x217.jpg\" alt=\"funny sms exams\" width=\"300\" height=\"217\" \/> funny sms exams[\/caption]\r\n<p style=\"text-align: center\">\u062f\u0631\u0633 \u062e\u0648\u0627\u0646\u062f\u0646 \u0686\u06cc\u0633\u062a\u061f\r\n.\r\n.\r\n.\r\n\u0628\u0647\u062a\u0631\u06cc\u0646 \u0642\u0631\u0635 \u062e\u0648...

I would like to extract all images like :

sms
(source: smsbaz.org)

I'm using this function but the size of array always is 0 :

public ArrayList<String> getImagesOfFromHtmlString(String str){

    ArrayList<String> arr_images = new ArrayList<>();
    Pattern pattern = Pattern.compile("(https?://\\s*\\S+\\.(?:jpg|JPEG|png|gif))");
    Matcher m = pattern.matcher(str);


    while(m.find()){
        arr_images.add(m.group());
    }


    return arr_images ;

}

where is my wrong ?

Upvotes: 1

Views: 199

Answers (1)

maraca
maraca

Reputation: 8743

This is a little bit dangerous, you could also have relative URLs. Anyway there seems to be a problem with your character classes, e.g. \s stands for whitespaces. Also I noted that you use group() in this case you don't need to capture, it will be the same as group(1) in your code. Here a solution, not perfect, but good enough to extract:

"src=[\"'](https?://[^\"']+?\\.(?:jpg|JPEG|png|gif))['\"]"

Upvotes: 1

Related Questions