Basel
Basel

Reputation: 700

Japanese Vertical text recognition with VNRecognizeTextRequest not working

I'm using the Apple OCR capabilities provided by the Vision Framework to recognize text in images. While I've had great success with horizontal text in Japanese, Korean, and Chinese, I'm encountering issues with vertical text.

Problem: When trying to recognize vertical text in these languages, the OCR returns nil.

What I've Tried:

images examples

enter image description here enter image description here

Code Snippet:

  func ocr() {
    
    guard let image = UIImage(named: imageName) else {
        print("Failed to load image")
        return
    }
    
    guard let cgImage = image.cgImage else {
        print("Failed to get CGImage from UIImage")
        return
    }
    
    // Request handler
    let handler = VNImageRequestHandler(cgImage: cgImage, orientation: .right, options: [:])
    
    let recognizeRequest = VNRecognizeTextRequest { (request, error) in
                    
        if let error = error {
            print("Failed to recognize text: \(error.localizedDescription)")
            return
        }
        
        // Parse the results as text
        guard let result = request.results as? [VNRecognizedTextObservation] else {
            print("No text found")
            return
        }
        
        let stringArray = result.compactMap { result in
            result.topCandidates(1).first?.string
        }
        
        
        let recognizedString = stringArray.joined(separator: "\n")
        
        
        let singleLineText = recognizedString
            .components(separatedBy: .newlines)
            .joined(separator: " ")

        
        DispatchQueue.main.async {
            self.recognizeText = singleLineText
        }
    }
    recognizeRequest.recognitionLanguages = ["ja"]

    recognizeRequest.revision = VNRecognizeTextRequestRevision3

    recognizeRequest.automaticallyDetectsLanguage = true
    
    recognizeRequest.recognitionLevel = .accurate
    recognizeRequest.usesLanguageCorrection = false
    
    
    do {
        try handler.perform([recognizeRequest])
    } catch {
        print("Failed to perform text recognition: \(error.localizedDescription)")
    }

}

Upvotes: 3

Views: 811

Answers (2)

Basel
Basel

Reputation: 700

After trying Apple Vision for 2 weeks, I discovered it does not support vertical text directly. Therefore, I sought alternative solutions and found that the Tesseract OCR library, a well-established tool developed by Google over 20 years ago, could potentially address this issue. Specifically, I found a trained model for vertical Japanese text (jpn_vert.traineddata) in the Tesseract repository.

For iOS, I used the SwiftyTesseract library, which is more modern and worked well for my needs. Below are the steps I followed to get it up and running:

Steps:

  1. Install SwiftyTesseract: Add SwiftyTesseract to your project using Swift Package Manager.
  2. Import SwiftyTesseract
  3. Download jpn_vert.traineddata from here
  4. Add the trained data to your project:
  • Create a folder named tessdata.

  • Add jpn_vert.traineddata to this folder.

  • Drag the tessdata folder to your Xcode project and select Create folder references.

  • on Edit Scheme then Run inside Environment Variables add name: TESSDATA_PREFIX value: $(PROJECT_DIR)/tessdata

Add This Extention

public typealias PageSegmentationMode = TessPageSegMode

public extension PageSegmentationMode {
  static let osdOnly = PSM_OSD_ONLY
  static let autoOsd = PSM_AUTO_OSD
  static let autoOnly = PSM_AUTO_ONLY
  static let auto = PSM_AUTO
  static let singleColumn = PSM_SINGLE_COLUMN
  static let singleBlockVerticalText = PSM_SINGLE_BLOCK_VERT_TEXT
  static let singleBlock = PSM_SINGLE_BLOCK
  static let singleLine = PSM_SINGLE_LINE
  static let singleWord = PSM_SINGLE_WORD
  static let circleWord = PSM_CIRCLE_WORD
  static let singleCharacter = PSM_SINGLE_CHAR
  static let sparseText = PSM_SPARSE_TEXT
  static let sparseTextOsd = PSM_SPARSE_TEXT_OSD
  static let count = PSM_COUNT
}

public extension Tesseract {
  var pageSegmentationMode: PageSegmentationMode {
    get {
      perform { tessPointer in
        TessBaseAPIGetPageSegMode(tessPointer)
      }
    }
    set {
      perform { tessPointer in
        TessBaseAPISetPageSegMode(tessPointer, newValue)
      }
    }
  }
}

Usage:

 func japaneseOCR() {
    let tesseract = Tesseract(languages: [ .custom("jpn_vert")])
    
    tesseract.pageSegmentationMode = .singleBlockVerticalText
    
    guard let image = UIImage(named: imageName) else {
        print("Failed to load image")
        return
    }
    
    guard let imageData = image.jpegData(compressionQuality: 1.0) else {
        print("Failed to load imageData")
        return
    }

    let result: Result<String, Tesseract.Error> = tesseract.performOCR(on: imageData)

    let result1 = try? result.get()
            
    self.recognizeText = result1 ?? ""
}

Result

enter image description here enter image description here

Upvotes: 3

Rethunk
Rethunk

Reputation: 4113

To read Japanese characters aligned vertically there's a hackish solution. Once I have a Swift implementation I'll post it, but in the meantime the step-by-step description I provide below may be sufficient for you to make progress.

But first, for quick comparison, check the performance of Google Vision API, which you can try for free: https://cloud.google.com/vision/docs/drag-and-drop

Sample output using Google Vision API

Try uploading images of individual columns to Google Vision API to see if performance improves. Alternately, try modifying the image to provide more whitespace between adjacent columns of characters.

Using the Vision framework, though, it's clear as of June 2024 that VNRecognizeTextRequest won't read a vertical column of characters. Here are some observations, based on your images

VNRecognizeTextRequest works fine for 2+ characters aligned horizontally.

For example, both of the following will read:

characters rearranged horizontally

The characters also read if the order is reversed:

characters rearranged horizontally in reverse order

When a single character is presented, it will not read. (This is similar to Vision's difficulty reading individual Arabic numerals, which appears to be partly "fixed" in the latest version of VNRecognizeTextRequest).

single character

However, a character can be rendered readable by creating an image in which the character is duplicated. (This also works for Latin script.)

enter image description here

You should be able to reproduce these results with single characters and duplicated characters using your favor image editor. Take notes as you edit the images, because the hack I propose will follow similar steps.

In short, with a little help from some image processing algorithms that (to my knowledge) aren't available in any of Apple's libraries, we're going to chop up images with vertical columns of characters to create new images with horizontal rows of the same characters. Then Vision will read the new images just fine.

Here are the basic steps:

  1. From the image of size (width, height), create a new image of size (height, width).
  2. Identify a bounding box for each character. (Described in more detail below).
  3. Identify the number of columns, and the characters belonging to each column.
  4. Copy & paste the rectangular subimage for each character from the original image to the new image, but rather than traversing a column (in the image Y direction) you'll be pasting the characters end to end in a row (in the image X direction).
  5. Run your OCR code on the artificially generated image.

More details.

1. Create a new image A CGImage is likely satisfactory here. It's reasonably easy to work with CGImages, but for performant image processing it can help to work with the image data at a lower level.

2. Identify a bounding box for each character. Approach this step with a simple technique, then iterate to improve robustness. For initial implementations I would strongly recommend techniques you could implement yourself, or use image processing functions you can understand with just a little bit of study.

As a first implementation, you can try either connected-component labeling or flood fill.

https://en.wikipedia.org/wiki/Connected-component_labeling

https://en.wikipedia.org/wiki/Flood_fill

Very loosely, you can think of connected-component labeling as flood fill, but with the filled regions having numbers as labels: region 1, region 2, region 3, and so on.

Once all the dark pixels of a character are identified as belonging to a single "component" (or blob), then you can easily find the bounds of that blob. It's common to find the rectangular bounds--min x, min y, max x, max y--but one can also find the "tight" bounds as a convex hull or related shape.

https://en.wikipedia.org/wiki/Convex_hull#:~:text=The%20convex%20hull%20may%20be,of%20points%20in%20the%20subset.

The open source library OpenCV has a function called findContours() that works well. It's a traditional algorithm with an implementation that's satisfactory. Under the covers there are two different techniques: a traditional "multipass" algorithm and a "single pass" algorithm that's relatively newer. The single pass algorithm is based on a paper by some Japanese researchers--I can dig up the paper, if you'd like.

Google "opencv swift tutorial" to find instructions about integrating the OpenCV library into your project. Make sure you branch your code properly before trying this!

My own implementation of the single pass algorithm in Swift is kinda usable, but it's meant for prototyping tests rather than production use.

To be clear: I'm not saying this is the "right" solution to segmenting characters from the background, but it's a fairly simple technique that you can implement right now. There are better techniques that take much more explaining.

Caveat: I'm more familiar with Chinese characters, but with Japanese characters I would also expect that some radicals could present a problem. If two radicals belong together, but aren't connected by a common stroke--or if a stroke has faded a lot--then you'll have two bounding boxes.

https://laits.utexas.edu/japanese/joshu/kanji/kanji_radicals/radicals2.html

For now I hope it's clear enough that if you can identify the rectangle of pixels in which a character is found, then you can copy that rectangle of pixels from the original image to the new image with horizontally aligned characters.

3. Identify the columns of characters It'd be handy to just prompt the user to select vertical OCR or horizontal OCR. Simple!

If there is sufficient separation between columns, then it's not too hard to identify characters that belong to the same column, even if that column is slightly angled relative to the vertical (Y) axis of the image.

If the vertical separation between bounding boxes for characters in the same column isn't much smaller than the horizontal separation between columns, then you may have some difficulty robustly determining whether columns are in rows or columns.

If your code determined that the vertical and horizontal spacing between characters is similar, and that it's hard to differentiate between vertical and horizontal alignment of characters, then you could try running OCR on both orientations and then picking a "winner." Voting schemes were a fairly common method to improve OCR read accuracy, and may still be in use in some libraries.

4. Copy & paste the character pixels to the new image For CGImage this is straightforward enough.

If you're working with OpenCV, meaning that you're using the cv::Mat type to represent image data, the technique for defining a subimage is straightforward, but a bit different.

5. Run OCR on the image of horizontally aligned characters. Using your code, OCR worked fine when I created rows of characters using an image editor to manipulate your original images of characters arranged in columns.

Other code As I mentioned above, VNRecognizeTextRequestRevision3 appears not to allow reading of individual Japanese characters, at least not with the parameter settings I tested. So if your code found the bounding box for a single character, you could create an image with two copies of that character, confirm that the doubled character reads, then use the OCR results for just one character.

Nowadays, OCR libraries that can read whole pages of text will include robustness checks and autocorrection to improve accuracy of individual words (or, I assume, Japanese characters) based on the context in which the character is found.


As I have time I'll edit this post to provide more implementation details. I didn't want to hold you up in case the description above was sufficient for you. Sorry, I'm super busy this week.

I would expect that some pretrained model or some OCR library may handle vertically aligned Japanese characters well. If you read textbooks about OCR, you'll find that a lot of Japanese and Chinese researchers are cited. Perhaps those papers will lead you to a 3rd party library that does a great job of on-device OCR for Japanese; you may need someone who knows Japanese to help you with install instructions, in case the academic paper is in English but the install instructions are natively Japanese.

But we must make certain concessions to the brevity of human life, and I think you and I would rather have a hack now that we could conceivably replace later. Hurray, technical debt!

Upvotes: 1

Related Questions