May 17, 2021

BlazeFace in iOS using CoreML

In this post, we will learn how we can implement a face detection model in an iOS app using CoreML.

You must read the TensorflowLite version of the post since there is a lot of similar code between both versions and functions that we will reuse in this post.

The training and explanation of the model are in this post. You can check the code for this app in this repository.

From TensorFlow to CoreML

A crucial point is to create the TensorFlow model or load the model and compile it using a batch size of one, we can compile the model just by running the model with an array like:

1x128x128x3

If we use a different batch size we could encunter problems when we load the model in the iOS app.

Once we have the model compiled we can convert to CoreML. We need the coremltools library installed:

import coremltools as ct 

coreml_model = ct.convert(model, inputs=[ct.ImageType(scale=1/255.0)])

coreml_model.save("Face500.mlmodel")

Interface

The interface is quite simple, we only need to show the camera view on the screen. The code to get this is similar to the code we used in the TensorFlow Lite post:

struct CameraView: View {
    @StateObject var camera = CameraModel()
    
    var body: some View {
        ZStack {
            CameraPreview(camera: camera).ignoresSafeArea(.all, edges: .all)
            FrameView(boxes: camera.boxes)
            Rectangle().strokeBorder(Color.blue).frame(width: camera.maxWidth, height: camera.maxHeight)
            Text(String(camera.frames)).position(x: 100, y: 100)
        }.onAppear(perform: {
            camera.Check()
        })
    }
}
struct CameraPreview: UIViewRepresentable {
    @ObservedObject var camera: CameraModel
    
    func makeUIView(context: Context) -> UIView {
        let view = UIView(frame: UIScreen.main.bounds)
        camera.preview = AVCaptureVideoPreviewLayer(session: camera.session)
        camera.preview.frame = view.frame
        camera.preview.videoGravity = .resizeAspectFill
        view.layer.addSublayer(camera.preview)
        
        camera.session.startRunning()
        
        camera.maxWidth = camera.preview.bounds.size.width
        
        return view
    }
    
    func updateUIView(_ uiView: UIView, context: Context) {
    }
}

I recommend you to check the whole code in the file ContentView.swift. Since we are using SwiftUI to build this app, there is a lot of code that can be improved.

Model Handler

The code to handle the model is in the same file as the interface. This could not be the best practice but the code is short so we can put them together.

Inside the CameraModel class we can find the following methods:

var classificationRequest: VNCoreMLRequest {
    let config = MLModelConfiguration()
    config.computeUnits = .cpuOnly

    do {
        let model = try Face500(configuration: config)
        let visionModel = try VNCoreMLModel(for: model.model)
        
        let visionRequest = VNCoreMLRequest(model: visionModel) { request, error in
            
            guard let results = (request.results as? [VNCoreMLFeatureValueObservation]) else {
                fatalError("Unexpected result type from VNCoreMLRequest")
            }
            
            guard let predictions = results[0].featureValue.multiArrayValue else {
                fatalError("Result 0 is not a MultiArray")
            }
            
            var arrayPredictions: [Float] = []
            
            for i in 0..<predictions.count {
                arrayPredictions.append(predictions[i].floatValue)
            }
            
            guard let boxes = results[1].featureValue.multiArrayValue else {
                fatalError("Result 1 is not a MultiArray")
            }
            
            let finalBoxes = self.getFinalBoxes(boxes: boxes, arrayPredictions: arrayPredictions)
            
            DispatchQueue.main.async {
                self.boxes = finalBoxes
            }

        }
        
        visionRequest.imageCropAndScaleOption = .centerCrop
        return visionRequest
        
    } catch {
        fatalError("Failed to load ML model: \(error)")
    }
}

We create a configuration object using MLModelConfiguration, we can change the computeUnits option to use the GPU or the neural engine or only the CPU as we do here.

When we export our CoreML model to XCode, this generates a class with the same name of the file, in this case the file and the class are named Face500. We use VNCoreMLRequest to generate a request to the model and get the results, these results are returned as VNCoreMLFeatureValueObservation from where we can get arrays:

guard let predictions = results[0].featureValue.multiArrayValue else {
    fatalError("Result 0 is not a MultiArray")
}

We transform these arrays to the final boxes using the getFinalBoxes method:

func getFinalBoxes(boxes: MLMultiArray, arrayPredictions: [Float]) -> [BoxPrediction] {
    let arrays = getArrays(boxes: boxes)

    let finalBoxes: [BoxPrediction] = AMS.getFinalBoxes(rawXArray: arrays.xArray, rawYArray: arrays.yarray, rawWidthArray: arrays.width, rawHeightArray: arrays.height, classPredictions: arrayPredictions, imageWidth: Float(imageWidth), imageHeight: Float(imageHeight), cameraSize: cameraSizeRect)

    return finalBoxes
}

This method calls the getFinalBoxes function to finally get our filtered predicted boxes. You can learn more about this function in the face post.

The classificationRequest function only saves the request that we will use when we want to run the model. To run the model we need the image from the camera:

func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    let pixelBuffer: CVPixelBuffer? = CMSampleBufferGetImageBuffer(sampleBuffer)
    
    guard let imagePixelBuffer = pixelBuffer else {
        return
    }
    
    imageWidth = CVPixelBufferGetWidth(imagePixelBuffer)
    imageHeight = CVPixelBufferGetHeight(imagePixelBuffer)
    
    runModel(bufferImage: imagePixelBuffer)
}

Each time we get a new frame from the camera, the captureOutput function triggers, here we call the runModel function:

func runModel(bufferImage: CVPixelBuffer) {
    let startTime = CACurrentMediaTime()

    let handler = VNImageRequestHandler(cvPixelBuffer: bufferImage, orientation: .up)

    do {
        try handler.perform([classificationRequest])
    } catch {
        print("Failed to perform classification: \(error.localizedDescription)")
    }

    let finalTime = CACurrentMediaTime()

    let fullComputationFrames = 1 / (finalTime - startTime)

    DispatchQueue.main.async {
        self.frames = Int(fullComputationFrames)
    }

}

We expect a CVPixelBuffer from the camera which we use to create a VNImageRequestHandler. We use the latter along with the classificationRequest to run our model.

We can notice how the classificationRequest object is the one that handles the results.

Conclusions

This post is quite simple since we already know the app functions and we center on present the CoreML code.

CoreML is easier to use than TFLite, CoreML automatically crops and resizes the input image so we only care about the results.

In this example, CoreML is a little bit faster than TensorFlowLite but the difference is minimal, at least using the CPU. Our face detection model is already small, thus, if we want to use the GPU or the neural engine the computation is slower almost by 20 frames. I guess this is due to the model size, so it's not worth moving all the data from the CPU to the GPU. Perhaps in older devices is better to use the GPU than the CPU.

Categories