Vision framework gives incorrect coordinates for rectangles in frames from captured video output

Question

I need to recognize rectangles in frames from a captured video. I use the following method to display a rectangle on top of an observed image.

func displayRect(for observation: VNRectangleObservation) {
    DispatchQueue.main.async { [weak self] in
        guard let size = self?.imageView.frame.size else { return }
        guard let origin = self?.imageView.frame.origin else { return }

        let transform = CGAffineTransform(scaleX: size.width, y: size.height)

        let rect = observation.boundingBox.applying(transform)
            .applying(CGAffineTransform(scaleX: 1.0, y: -1.0))
            .applying(CGAffineTransform(translationX: 0.0, y: size.height))
            .applying(CGAffineTransform(translationX: -origin.x, y: -origin.y))

        let path = UIBezierPath(rect: rect)

        let layer = CAShapeLayer()
        layer.path = path.cgPath
        layer.fillRule = kCAFillRuleEvenOdd
        layer.fillColor = UIColor.red.withAlphaComponent(0.2).cgColor

        self?.overlay.sublayers = nil
        self?.overlay.addSublayer(layer)
    }
}

This works just fine with images taken from the camera, but for frames from captured video the rectangle is off. In fact, it looks like it (and thus the entire coordinate system for the image) if off by 90 degrees. Please see the screenshots below.

Am I missing something about video frames that could cause the observation's boundingBox property to be in an entirely different coordinate system?

Below is my implementation of the captureOutput delegate method.

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    guard let buffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }

    // Also tried converting to CGImage, creating handler from that, but made no difference
    let handler = VNImageRequestHandler(cvPixelBuffer: buffer, options: [:])

    let request = VNDetectRectanglesRequest()
    request.minimumAspectRatio = VNAspectRatio(0.2)
    request.maximumAspectRatio = VNAspectRatio(1.0)
    request.minimumSize = Float(0.3)

    try? handler.perform([request])

    // Note: Only ever captures one rectangle, so calling `first` not the issue.
    guard let observations = request.results as? [VNRectangleObservation],
        let observation = observations.first else {
            return removeShapeLayer()
    }

    displayRect(for: observation, buffer: buffer)
}

Craig Siemens · Accepted Answer

This issue is that you're not passing the orientation of the buffer to the VNImageRequestHandler so it is trading the video as landscape. Then when it return that rect, you place that above the video that is being displayed in portrait.

You'll either need to pass the orientation to the VNImageRequestHandler, or modify (rotate) the rectangle returned to take that into account.

Vision framework gives incorrect coordinates for rectangles in frames from captured video output

Answers (1)

Related Questions