---
id: sm-vision-framework
name: "vision-framework"
url: https://skills.yangsir.net/skill/sm-vision-framework
author: dpearson2699
domain: ai-app-building-integration
tags: ["apple-vision-framework", "computer-vision", "ios-ai", "image-analysis"]
install_count: 1600
rating: 4.30 (37 reviews)
github: https://github.com/dpearson2699/swift-ios-skills
---

# vision-framework

> 实现移动应用中的计算机视觉功能，包括文本识别、图像分析等，增强用户交互体验。

**Stats**: 1,600 installs · 4.3/5 (37 reviews)

## Before / After 对比

### 移动端计算机视觉功能实现效率对比

## Readme

# Vision Framework

Detect text, faces, barcodes, objects, and body poses in images and video using
on-device computer vision. Patterns target iOS 26+ with Swift 6.2,
backward-compatible where noted.

See `references/vision-requests.md` for complete code patterns and
`references/visionkit-scanner.md` for DataScannerViewController integration.

## Contents

- [Two API Generations](#two-api-generations)
- [Request Pattern (Modern API)](#request-pattern-modern-api)
- [Text Recognition (OCR)](#text-recognition-ocr)
- [Face Detection](#face-detection)
- [Barcode Detection](#barcode-detection)
- [Document Scanning (iOS 26+)](#document-scanning-ios-26)
- [Image Segmentation](#image-segmentation)
- [Object Tracking](#object-tracking)
- [Other Request Types](#other-request-types)
- [Core ML Integration](#core-ml-integration)
- [VisionKit: DataScannerViewController](#visionkit-datascannerviewcontroller)
- [Common Mistakes](#common-mistakes)
- [Review Checklist](#review-checklist)
- [References](#references)

## Two API Generations

Vision has two distinct API layers. Prefer the modern API for new code.

| Aspect | Modern (iOS 18+) | Legacy |
|---|---|---|
| Pattern | `let result = try await request.perform(on: image)` | `VNImageRequestHandler` + completion handler |
| Request types | Swift types — structs and classes (`RecognizeTextRequest`, `DetectFaceRectanglesRequest`) | ObjC classes (`VNRecognizeTextRequest`, `VNDetectFaceRectanglesRequest`) |
| Concurrency | Native async/await | Completion handlers or synchronous `perform` |
| Observations | Typed return values | Cast `results` from `[Any]` |
| Availability | iOS 18+ / macOS 15+ | iOS 11+ |

The modern API uses the `ImageProcessingRequest` protocol. Each request type
has a `perform(on:orientation:)` method that accepts `CGImage`, `CIImage`,
`CVPixelBuffer`, `CMSampleBuffer`, `Data`, or `URL`. Most requests are
structs; stateful requests for video tracking (e.g., `TrackObjectRequest`,
`TrackRectangleRequest`, `DetectTrajectoriesRequest`) are final classes.

## Request Pattern (Modern API)

All modern Vision requests follow the same pattern: create a request struct,
call `perform(on:)`, and handle the typed result.

```swift
import Vision

func recognizeText(in image: CGImage) async throws -> [String] {
    var request = RecognizeTextRequest()
    request.recognitionLevel = .accurate
    request.recognitionLanguages = [Locale.Language(identifier: "en-US")]

    let observations = try await request.perform(on: image)
    return observations.compactMap { observation in
        observation.topCandidates(1).first?.string
    }
}
```

### Legacy Pattern (Pre-iOS 18)

Use `VNImageRequestHandler` with completion-based requests when targeting
older deployment versions.

```swift
import Vision

func recognizeTextLegacy(in image: CGImage) throws -> [String] {
    var recognized: [String] = []
    let request = VNRecognizeTextRequest { request, error in
        guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
        recognized = observations.compactMap { $0.topCandidates(1).first?.string }
    }
    request.recognitionLevel = .accurate

    let handler = VNImageRequestHandler(cgImage: image)
    try handler.perform([request])
    return recognized
}
```

## Text Recognition (OCR)

### Modern: RecognizeTextRequest (iOS 18+)

```swift
var request = RecognizeTextRequest()
request.recognitionLevel = .accurate       // .fast for real-time
request.recognitionLanguages = [
    Locale.Language(identifier: "en-US"),
    Locale.Language(identifier: "fr-FR"),
]
request.usesLanguageCorrection = true
request.customWords = ["SwiftUI", "Xcode"] // domain-specific terms

let observations = try await request.perform(on: cgImage)
for observation in observations {
    guard let candidate = observation.topCandidates(1).first else { continue }
    let text = candidate.string
    let confidence = candidate.confidence  // 0.0 ... 1.0
    let bounds = observation.boundingBox   // normalized coordinates
}
```

### Legacy: VNRecognizeTextRequest

```swift
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
request.recognitionLanguages = ["en-US", "fr-FR"]
request.usesLanguageCorrection = true
```

**Key differences:** Modern API uses `Locale.Language` for languages; legacy
uses string identifiers. Both support `.accurate` (best quality) and `.fast`
(real-time suitable) recognition levels.

## Face Detection

Detect face rectangles, landmarks (eyes, nose, mouth), and capture quality.

```swift
// Modern API
let faceRequest = DetectFaceRectanglesRequest()
let faces = try await faceRequest.perform(on: cgImage)

for face in faces {
    let boundingBox = face.boundingBox   // normalized CGRect
    let roll = face.roll                 // Measurement<UnitAngle>
    let yaw = face.yaw                  // Measurement<UnitAngle>
}

// Landmarks (eyes, nose, mouth contours)
var landmarkRequest = DetectFaceLandmarksRequest()
let landmarkFaces = try await landmarkRequest.perform(on: cgImage)
for face in landmarkFaces {
    let landmarks = face.landmarks
    let leftEye = landmarks?.leftEye?.normalizedPoints
    let nose = landmarks?.nose?.normalizedPoints
}
```

### Coordinate System

Vision uses a normalized coordinate system with origin at the bottom-left.
Convert to UIKit (top-left origin) before display:

```swift
func convertToUIKit(_ rect: CGRect, imageHeight: CGFloat) -> CGRect {
    CGRect(
        x: rect.origin.x,
        y: imageHeight - rect.origin.y - rect.height,
        width: rect.width,
        height: rect.height
    )
}
```

## Barcode Detection

Detect 1D and 2D barcodes including QR codes.

```swift
var request = DetectBarcodesRequest()
request.symbologies = [.qr, .ean13, .code128, .pdf417]

let barcodes = try await request.perform(on: cgImage)
for barcode in barcodes {
    let payload = barcode.payloadString          // decoded content
    let symbology = barcode.symbology            // .qr, .ean13, etc.
    let bounds = barcode.boundingBox             // normalized rect
}
```

Common symbologies: `.qr`, `.aztec`, `.pdf417`, `.dataMatrix`, `.ean8`,
`.ean13`, `.code39`, `.code128`, `.upce`, `.itf14`.

## Document Scanning (iOS 26+)

`RecognizeDocumentsRequest` provides structured document reading with layout
understanding beyond basic OCR. Returns `DocumentObservation` objects with a
nested `Container` structure for paragraphs, tables, lists, and barcodes.

```swift
var request = RecognizeDocumentsRequest()
let documents = try await request.perform(on: cgImage)

for observation in documents {
    let container = observation.document

    // Full text content
    let fullText = container.text

    // Structured access to paragraphs
    for paragraph in container.paragraphs {
        let paragraphText = paragraph.text
    }

    // Tables and lists
    for table in container.tables { /* structured table data */ }
    for list in container.lists { /* structured list data */ }

    // Embedded barcodes detected within the document
    for barcode in container.barcodes { /* barcode data */ }

    // Document title if detected
    if let title = container.title { print(title) }
}
```

For simpler document camera scanning, use VisionKit's
`VNDocumentCameraViewController` which provides a full-screen camera UI with
auto-capture, perspective correction, and multi-page scanning.

## Image Segmentation

### Modern: GeneratePersonSegmentationRequest (iOS 18+)

```swift
var request = GeneratePersonSegmentationRequest()
request.qualityLevel = .accurate  // .balanced, .fast

let mask = try await request.perform(on: cgImage)
// mask is a PersonSegmentationObservation with a pixelBuffer property
let maskBuffer = mask.pixelBuffer
// Apply mask using Core Image: CIFilter.blendWithMask()
```

### Legacy: VNGeneratePersonSegmentationRequest

```swift
let request = VNGeneratePersonSegmentationRequest()
request.qualityLevel = .accurate  // .balanced, .fast
request.outputPixelFormat = kCVPixelFormatType_OneComponent8

let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])

guard let mask = request.results?.first?.pixelBuffer else { return }
// Apply mask using Core Image: CIFilter.blendWithMask()
```

Quality levels:
- `.accurate` -- best quality, slowest (~1s), full resolution
- `.balanced` -- good quality, moderate speed (~100ms), 960x540
- `.fast` -- lowest quality, fastest (~10ms), 256x144, suitable for real-time

### Instance Segmentation (iOS 18+)

Separate masks per person for individual effects.

```swift
// Modern API (iOS 18+)
let request = GeneratePersonInstanceMaskRequest()
let observation = try await request.perform(on: cgImage)
let indices = observation.allInstances

for index in indices {
    let mask = try observation.generateMask(forInstances: IndexSet(integer: index))
    // mask is a CVPixelBuffer with only this person visible
}
```

```swift
// Legacy API (iOS 17+)
let request = VNGeneratePersonInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])

guard let result = request.results?.first else { return }
let indices = result.allInstances
for index in indices {
    let instanceMask = try result.generateMaskedImage(
        ofInstances: IndexSet(integer: index),
        from: handler,
        croppedToInstancesExtent: false
    )
}
```

See `references/vision-requests.md` for mask composition and Core Image filter
integration patterns.

## Object Tracking

### Modern: TrackObjectRequest (iOS 18+)

`TrackObjectRequest` is a stateful request that maintains tracking context
across frames. Conforms to both `ImageProcessingRequest` and `StatefulRequest`.

```swift
// Initialize with a detected object's bounding box
let initialObservation = DetectedObjectObservation(boundingBox: detectedRect)
var request = TrackObjectRequest(observation: initialObservation)
request.trackingLevel = .accurate

// For each video frame:
let results = try await request.perform(on: pixelBuffer)
if let tracked = results.first {
    let updatedBounds = tracked.boundingBox
    let confidence = tracked.confidence
}
```

### Legacy: VNTrackObjectRequest

```swift
let trackRequest = VNTrackObjectRequest(detectedObjectObservation: initialObservation)
trackRequest.trackingLevel = .accurate

let sequenceHandler = VNSequenceRequestHandler()
// For each frame:
try sequenceHandler.perform([trackRequest], on: pixelBuffer)
if let result = trackRequest.results?.first {
    let updatedBounds = result.boundingBox
    trackRequest.inputObservation = result
}
```

## Other Request Types

Vision provides additional requests covered in `references/vision-requests.md`:

| Request | Purpose |
|---|---|
| `ClassifyImageRequest` | Classify scene content (outdoor, food, animal, etc.) |
| `GenerateAttentionBasedSaliencyImageRequest` | Heat map of where viewers focus attention |
| `GenerateObjectnessBasedSaliencyImageRequest` | Heat map of object-like regions |
| `GenerateForegroundInstanceMaskRequest` | Foreground object segmentation (not person-specific) |
| `DetectRectanglesRequest` | Detect rectangular shapes (documents, cards, screens) |
| `DetectHorizonRequest` | Detect horizon angle for auto-leveling photos |
| `DetectHumanBodyPoseRequest` | Detect body joints (shoulders, elbows, knees) |
| `DetectHumanBodyPose3DRequest` | 3D human body pose estimation |
| `DetectHumanHandPoseRequest` | Detect hand joints and finger positions |
| `DetectAnimalBodyPoseRequest` | Detect animal body joint positions |
| `DetectFaceCaptureQualityRequest` | Face capture quality scoring (0–1) for photo selection |
| `TrackRectangleRequest` | Track rectangular objects across video frames |
| `TrackOpticalFlowRequest` | Optical flow between video frames |
| `DetectTrajectoriesRequest` | Detect object trajectories in video |

All modern request types above are iOS 18+ / macOS 15+.

## Core ML Integration

Run custom Core ML models through Vision for automatic image preprocessing
(resizing, normalization, color space conversion).

```swift
// Modern API (iOS 18+)
let model = try MLModel(contentsOf: modelURL)
let request = CoreMLRequest(model: .init(model))
let results = try await request.perform(on: cgImage)

// Classification model
if let classification = results.first as? ClassificationObservation {
    let label = classification.identifier
    let confidence = classification.confidence
}
```

```swift
// Legacy API
let vnModel = try VNCoreMLModel(for: model)
let request = VNCoreMLRequest(model: vnModel) { request, error in
    guard let results = request.results as? [VNClassificationObservation] else { return }
    let topResult = results.first
}
let handler = VNImageRequestHandler(cgImage: cgImage)
try handler.perform([request])
```

For model conversion and optimization, see the `coreml` skill.

## VisionKit: DataScannerViewController

`DataScannerViewController` provides a full-screen live camera scanner for text
and barcodes. See `references/visionkit-scanner.md` for complete patterns.

### Quick Start

```swift
import VisionKit

// Check availability (requires A12+ chip and camera)
guard DataScannerViewController.isSupported,
      DataScannerViewController.isAvailable else { return }

let scanner = DataScannerViewController(
    recognizedDataTypes: [
        .text(languages: ["en"]),
        .barcode(symbologies: [.qr, .ean13])
    ],
    qualityLevel: .balanced,
    recognizesMultipleItems: true,
    isHighFrameRateTrackingEnabled: true,
    isHighlightingEnabled: true
)
scanner.delegate = self
present(scanner, animated: true) {
    try? scanner.startScanning()
}
```

### SwiftUI Integration

Wrap `DataScannerViewController` in `UIViewControllerRepresentable`. See
`references/visionkit-scanner.md` for the full implementation.

## Common Mistakes

**DON'T:** Use the legacy `VNImageRequestHandler` API for new iOS 18+ projects.
**DO:** Use modern struct-based requests with `perform(on:)` and async/await.
**Why:** Modern API provides type safety, better Swift concurrency support, and cleaner error handling.

**DON'T:** Forget to convert normalized coordinates before drawing bounding boxes.
**DO:** Use `VNImageRectForNormalizedRect(_:_:_:)` or manual conversion from bottom-left origin to UIKit top-left origin.
**Why:** Vision uses normalized coordinates (0...1) with bottom-left origin; UIKit uses points with top-left origin.

**DON'T:** Run Vision requests on the main thread.
**DO:** Perform requests on a background thread or use async/await from a detached task.
**Why:** Image analysis is CPU/GPU-intensive and blocks the UI if run on the main actor.

**DON'T:** Use `.accurate` recognition level for real-time camera feeds.
**DO:** Use `.fast` for live video, `.accurate` for still images or offline processing.
**Why:** Accurate recognition is too slow for 30fps video; fast recognition trades quality for speed.

**DON'T:** Ignore the `confidence` score on observations.
**DO:** Filter results by confidence threshold (e.g., > 0.5) appropriate for your use case.
**Why:** Low-confidence results are often incorrect and degrade user experience.

**DON'T:** Create a new `VNImageRequestHandler` for each frame when tracking objects.
**DO:** Use `VNSequenceRequestHandler` for video frame sequences.
**Why:** Sequence handler maintains temporal context for tracking; per-frame handlers lose state.

**DON'T:** Request all barcode symbologies when you only need QR codes.
**DO:** Specify only the symbologies you need in the request.
**Why:** Fewer symbologies means faster detection and fewer false positives.

**DON'T:** Assume `DataScannerViewController` is available on all devices.
**DO:** Check both `isSupported` (hardware) and `isAvailable` (user permissions) before presenting.
**Why:** Requires A12+ chip; `isAvailable` also checks camera access authorization.

## Review Checklist

- [ ] Uses modern Vision API (iOS 18+) unless targeting older deployments
- [ ] Vision requests run off the main thread (async/await or background queue)
- [ ] Normalized coordinates converted before UI display
- [ ] Confidence threshold applied to filter low-quality observations
- [ ] Recognition level matches use case (`.fast` for video, `.accurate` for stills)
- [ ] Language hints set for text recognition when input language is known
- [ ] Barcode symbologies limited to only those needed
- [ ] `DataScannerViewController` availability checked before presentation
- [ ] Camera usage description (`NSCameraUsageDescription`) in Info.plist for VisionKit
- [ ] Person segmentation quality level appropriate for use case
- [ ] `VNSequenceRequestHandler` used for video frame tracking (not per-frame handler)
- [ ] Error handling covers request failures and empty results

## References

- Vision request patterns: `references/vision-requests.md`
- VisionKit scanner integration: `references/visionkit-scanner.md`
- Apple docs: [Vision](https://sosumi.ai/documentation/vision) |
  [VisionKit](https://sosumi.ai/documentation/visionkit) |
  [RecognizeTextRequest](https://sosumi.ai/documentation/vision/recognizetextrequest) |
  [DataScannerViewController](https://sosumi.ai/documentation/visionkit/datascannerviewcontroller)


---
*Source: https://skills.yangsir.net/skill/sm-vision-framework*
*Markdown mirror: https://skills.yangsir.net/api/skill/sm-vision-framework/markdown*