Vicente Rodríguez

May 17, 2021

BlazeFace in iOS using TensorFlow Lite

In this post we will create an iOS app to detect faces using a deep learning model that we trained from scratch here.

With the help of TensorFlowLite we can load and run models trained using TensorFlow on mobile devices. In the case of an iOS app, we also need to handle the camera of the device to get the frames, give these frames to the model and print the results on the screen, in this case, draw the boxes of the detected faces.

In order to build the interface, we will use SwiftUI which is a brand new way to create interfaces for apps using code instead of a visual template. Thus, the implementation of the app could not be the most correct way to create an app like this since the library is basically new.

Is recommendable to have a little bit of experience developing iOS apps since we are not going to see each step.

In this repository you can find all the code along with the TFLite model.

If you have xcode installed you can open the FaceRecognitionAppTF.xcodeproj file to view the whole project, you also need an iPad or iPhone to test and install the app since the simulator does not have access to a camera.

TF to TFLite

To convert the TensorFlow model to TFLite we can use the following code:


model.save('saved_model/face_model_500')

converter = tf.lite.TFLiteConverter.from_saved_model('saved_model/face_model_500')



# if we want to add optimization

# converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]



tflite_model = converter.convert()



with open('face_500.tflite', 'wb') as f:

  f.write(tflite_model)

In the Jupiter notebook from the python version of the work you can also find code to test the TFLite model. This is useful when we add optimizations to the TFLite model and we want to observe how it performs.

Interface

The first step to develop this app is to build the user interface, this interface is quite simple since we only need to show the camera preview on the screen.


struct CameraPreview: UIViewRepresentable {

    @ObservedObject var camera: CameraModel



    func makeUIView(context: Context) -> UIView {

        let view = UIView(frame: UIScreen.main.bounds)

        camera.preview = AVCaptureVideoPreviewLayer(session: camera.session)

        camera.preview.frame = view.frame

        camera.preview.videoGravity = .resizeAspectFill

        view.layer.addSublayer(camera.preview)



        camera.session.startRunning()



        camera.maxWidth = camera.preview.bounds.size.width



        return view

    }



    func updateUIView(_ uiView: UIView, context: Context) {

    }

}

The CameraPreview struct is used to build an UIView that works with the SwiftUI library, we use the AVCaptureVideoPreviewLayer function to get a preview view from the camera view, here we have some relevant lines of code, the first line:


camera.preview.videoGravity = .resizeAspectFill

Tells the preview how it should appear, we have to remember that the cameras often have a resolution of 1920x1080 or 4k and an aspect ratio of 16:9 but the screen resolution of each device is quite different and also the aspect ratio, due to this difference, some parts of the camera do not appear on the screen but if we take the image that the camera perceives we will view more content. We will take this into account in the future to correctly draw the boxes on the screen.

We can also use .resizeAspect so we are able to see the whole frame that the camera perceives but the view on the screen will present black lines if it doesn't fit completely.

The second relevant line is:


camera.maxWidth = camera.preview.bounds.size.width

We will save in the maxWidth variable the width of the screen so we can draw a square box to show what part of the image will be used as input of the model. Our model takes as input a square image of size 128x128, then we have to crop our frames of size 1920x1080 to 1080x1080. Therefore, the model won't see these cropped areas, the square is a guide about what the final cropped frame will be.

The final relevant line is:


@ObservedObject var camera: CameraModel

In SwiftUI we create observable objects which contain properties that could change or not, this is done to avoid refreshing a whole object each time something changes. The CameraModel object is the following one:


class CameraModel: NSObject, ObservableObject, AVCaptureVideoDataOutputSampleBufferDelegate {

    @Published var isTaken = false

    @Published var session = AVCaptureSession()



    @Published var alert = false

    @Published var output = AVCaptureVideoDataOutput()



    @Published var preview: AVCaptureVideoPreviewLayer!



    var modelHandler: ModelHandlerTF? = ModelHandlerTF(modelFileInfo: (name: "face_500", extension: "tflite"))



    @Published var boxes: [BoxPrediction] = []

    @Published var frames: Int = 60

    @Published var myImage: UIImage = UIImage(named: "test.png")!



    var maxWidth: CGFloat = 0

    var maxHeight: CGFloat = 0

    var cameraSizeRect: CGRect = CGRect.zero



    func Check() {

        switch AVCaptureDevice.authorizationStatus(for: .video) {

        case .authorized:

            setUp()

            return

        case .notDetermined:

            AVCaptureDevice.requestAccess(for: .video) { (status) in

                if status {

                    self.setUp()

                }

            }

        case .denied:

            self.alert.toggle()

            return

        default:

            return

        }

    }



    func CheckModel() {

        guard modelHandler != nil else {

            fatalError("Failed to load model")

        }

    }



    func setUp() {

        do {

            self.session.beginConfiguration()

            let device = AVCaptureDevice.default(.builtInWideAngleCamera, for: .video, position: .front)



            let input = try AVCaptureDeviceInput(device: device!)



            if self.session.canAddInput(input) {

                self.session.addInput(input)

            }



            self.output.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))

            self.output.alwaysDiscardsLateVideoFrames = true

            self.output.videoSettings = [ String(kCVPixelBufferPixelFormatTypeKey) : kCMPixelFormat_32BGRA]



            if self.session.canAddOutput(self.output) {

                self.session.addOutput(self.output)

                self.output.connection(with: .video)?.videoOrientation = .portrait

            }



            self.session.commitConfiguration()



        } catch {

            print(error.localizedDescription)

        }



        cameraSizeRect = preview.layerRectConverted(fromMetadataOutputRect: CGRect(x: 0, y: 0, width: 1, height: 1))

    }



    func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {

        let pixelBuffer: CVPixelBuffer? = CMSampleBufferGetImageBuffer(sampleBuffer)



        guard let imagePixelBuffer = pixelBuffer else {

            return

        }



        let startTime = CACurrentMediaTime()



        let results = modelHandler?.runModel(onFrame: imagePixelBuffer, cameraSize: cameraSizeRect)

        let finalBoxes = results?.0 ?? [AverageMaximumSuppresion.emptyBox]



        let myResultImage = results!.1



        let finalTime = CACurrentMediaTime()



        let fullComputationFrames = 1 / (finalTime - startTime)



        DispatchQueue.main.async {

            self.boxes = finalBoxes

            self.frames = Int(fullComputationFrames)

            self.myImage = myResultImage

        }

    }

}

All the Published variables in this object are observed objects which values will change and notify the UIViews so they update, here the boxes, frames and myImage variables will change everytime the model predicts new boxes.

The function setUp will set up everything that the app needs to handle the camera of the device, we choose the position: .front front camera of the device.

The following line is really important:


cameraSizeRect = preview.layerRectConverted(fromMetadataOutputRect: CGRect(x: 0, y: 0, width: 1, height: 1))

The layerRectConverted function transforms the coordinates of the frame to the coordinates of the screen.

To give you an idea of the whole process let's say that we are using an iPhone 12 which screen size is 6.1 inches with a usable resolution of 844x390 where 844 are the pixels available in the height and 390 the pixels available in the width. The size of each frame is 1920x1080 since the app is going to be in portrait mode, 1920 is the height and 1080 the width, we first crop the image to 1080x1080 so we have a square image that we resize to 128x128, the origin of the boxes will be associated to the 1080x1080 image, we first need to "uncrop" the boxes coordinates, so we retrieve the 840 pixels that were removed (this is something we will do in a future section) once we have the coordinates of each box in the 1920x1080 frame, we have to translate those coordinates to the device screen, 844x390.

We had talked about how the final preview that we view on the screen is not the whole frame since we have to cut part of the frame so the preview takes all the space available on the screen. To achieve this, the preview view changes its size, in the present example the final height of the preview view is the same as the device height size, 844 pixels, but the width changes to 474, thus we are not able to see 42 pixels of each size of the frame.

In the case of an iPad 8 gen, another device where the app was also tested, the camera resolution is 1280x720 and the screen size is 1080x810, here the preview changes its size to 1440x810 so it can occupy all the screen, so we have 180 pixels below and above which we don't perceive from the frame.

The layerRectConverted function allows us to know this, we first pass a CGRect object where we say that we want to translate the coordinates 0,1 0,1 from the frame to the screen, these coordinates indicate the origin 0,0 in X, Y and a width of 100%, thus the whole frame, the output is a CGRect object with the translated coordinates, in the case of the iPad, the output is (1440.0, -180.0) for the Y coordinate and (810.0000000000001, -8.817456953860943e-14) for the X coordinate, the negative number means that we would need 180 pixels below and above to fit the whole frame in the screen, then the preview cuts 180 pixels below and above from the frame, in the case of the width the preview cuts less than a pixel so it's an insignificant difference.

You can change the:


camera.preview.videoGravity = .resizeAspectFill

value to .resizeAspect to notice the crop that the preview has to do.

The line:


var modelHandler: ModelHandlerTF? = ModelHandlerTF(modelFileInfo: (name: "face_500", extension: "tflite"))

Loads the ModelHandlerTF class that will receive the new frames from the camera and return a BoxPrediction array:


@Published var boxes: [BoxPrediction] = []

and an UIimage:


@Published var myImage: UIImage = UIImage(named: "test.png")!

The captureOutput function, triggers each time we get a new frame from the camera, here we pass each frame to the modelHandler?.runModel function that returns the boxes coordinates.

Finally, in the interface we have to more functions:




struct FrameView: View {

    var boxes: [BoxPrediction] = []



    var body: some View {

        ForEach(boxes, id: \.id) { box in

            Rectangle().strokeBorder(Color.red).frame(width: box.rect.width, height: box.rect.height).position(x: box.rect.midX, y: box.rect.midY)

        }

    }

}



struct CameraView: View {

    @StateObject var camera = CameraModel()



    var body: some View {

        ZStack {

            CameraPreview(camera: camera).ignoresSafeArea(.all, edges: .all)

            FrameView(boxes: camera.boxes)

            Rectangle().strokeBorder(Color.blue).frame(width: camera.maxWidth, height: camera.maxHeight)

            Text(String(camera.frames)).position(x: 100, y: 100)

            VStack {

//                Image(uiImage: camera.myImage).resizable().frame(width: 128, height: 128).position(x: 64, y: 128)

//                Image(uiImage: camera.myImage).resizable().frame(width: 240, height: 426).position(x: 405, y: 440)

            }

        }.onAppear(perform: {

            camera.Check()

            camera.CheckModel()

        })

    }

}

The CameraView is the view that we will see on the screen, here we have a reference to the CameraModel which is a state object, thus each time this object changes we will refresh the view to reflect the new information. Inside the variable body, we put all the pieces that we want to show on the screen, like the CameraPreview. The FrameView is a view that draws one box for each detected face, here we pass the coordinates of these boxes.

We can also draw an image, this image is to debug what the camera perceives and what gets cropped from the screen. Also, we can draw the final image that the model takes as input, this is the square image of size 128x128.

Model Handler

In a different file we have the code to load and run our model using TFLite:


class ModelHandlerTF: NSObject {

    let threadCount: Int

    let threadCountLimit = 10



    let threshold: Float = 0.75



    let batchSize = 1

    let inputChannels = 3

    let inputWidth = 128

    let inputHeight = 128

    let boxesCount = 896



    private var myInterpreter: Interpreter



    private let bgraPixel = (channels: 4, alphaComponent: 3, lastBgrComponent: 2)

    private let rgbPixelChannels = 3

    private let AMS = AverageMaximumSuppresion()



    init?(modelFileInfo: FileInfo, threadCount: Int = 1) {

        let modelFilename = modelFileInfo.name



        guard let modelPath = Bundle.main.path(

            forResource: modelFilename,

            ofType: modelFileInfo.extension

        ) else {

            print("Failed to load the model file with name: \(modelFilename).")

            return nil

        }



        self.threadCount = threadCount

        var options = Interpreter.Options()

        options.threadCount = threadCount



        do {

            myInterpreter = try Interpreter(modelPath: modelPath, options: options)

            try myInterpreter.allocateTensors()

        } catch let error {

            print("Failed to create the interpreter with error: \(error.localizedDescription)")

            return nil

        }



        super.init()

    }



    func runModel(onFrame pixelBuffer: CVPixelBuffer, cameraSize: CGRect) -> ([BoxPrediction], UIImage) {

        let imageWidth = CVPixelBufferGetWidth(pixelBuffer) // 1080

        let imageHeight = CVPixelBufferGetHeight(pixelBuffer) // 1920

        let sourcePixelFormat = CVPixelBufferGetPixelFormatType(pixelBuffer)



        assert(sourcePixelFormat == kCVPixelFormatType_32ARGB ||

                     sourcePixelFormat == kCVPixelFormatType_32BGRA ||

                       sourcePixelFormat == kCVPixelFormatType_32RGBA)



        let imageChannels = 4

        assert(imageChannels >= inputChannels)



        let scaledSize = CGSize(width: inputWidth, height: inputHeight)



        guard let scaledPixelBuffer = pixelBuffer.resized(to: scaledSize) else {

            return ([AverageMaximumSuppresion.emptyBox], UIImage.init(named: "test.png")!)

        }



        let tensorPredictions: Tensor

        let tensorBoxes: Tensor



        var resizedCgImage: CGImage?

        // change pixelBuffer to scaledPixelBuffer to have the cropped image

        VTCreateCGImageFromCVPixelBuffer(pixelBuffer, options: nil, imageOut: &resizedCgImage)

        let resizedUiImage = UIImage(cgImage: resizedCgImage!)



        do {

            let inputTensor = try myInterpreter.input(at: 0)



            guard let rgbData = rgbDataFromBuffer(

                scaledPixelBuffer,

                byteCount: batchSize * inputWidth * inputHeight * inputChannels,

                isModelQuantized: inputTensor.dataType == .uInt8

            ) else {

                print("Failed to convert the image buffer to RGB data.")

                return ([AverageMaximumSuppresion.emptyBox], UIImage.init(named: "test.png")!)

            }



            try myInterpreter.copy(rgbData, toInputAt: 0)

            try myInterpreter.invoke()



            tensorPredictions = try myInterpreter.output(at: 0)

            tensorBoxes = try myInterpreter.output(at: 1)



            let predictions = [Float](unsafeData: tensorPredictions.data) ?? []



            let boxes = [Float](unsafeData: tensorBoxes.data) ?? []



            let arrays = getArrays(boxes: boxes)



            let finalBoxes: [BoxPrediction] = AMS.getFinalBoxes(rawXArray: arrays.xArray, rawYArray: arrays.yarray, rawWidthArray: arrays.width, rawHeightArray: arrays.height, classPredictions: predictions, imageWidth: Float(imageWidth), imageHeight: Float(imageHeight), cameraSize: cameraSize)



            return (finalBoxes, resizedUiImage)



        } catch let error {

            print("Failed to invoke the interpreter with error: \(error.localizedDescription)")

            return ([AverageMaximumSuppresion.emptyBox], UIImage.init(named: "test.png")!)

        }



    }

}

Part of this code is the implementation that we can find in the examples of tflite apps, in the init method we prepare the TensorFlow interpreter whereas in the runModel method we resize and crop the frame at the center to get a square image, we can use the lines:


var resizedCgImage: CGImage?

VTCreateCGImageFromCVPixelBuffer(scaledPixelBuffer, options: nil, imageOut: &resizedCgImage)

let resizedUiImage = UIImage(cgImage: resizedCgImage!)

to transform the cropped image to an UIImage and show this image in the interface to debug the app and see what the model takes as input.

This class has two more methods that you can find in the repository, one called rgbDataFromBuffer and the other called getArrays we use the former to transform our frame from the camera to a rgb image that the model can process, the latter function transform the output of the model to different arrays xArray, yarray, width, height.

Finally the getFinalBoxes method from the AMS is where we pass all the output boxes and we filter the ones with faces.

Non Maximum Blending

In Python we have access to several libraries to make linear algebra computations, things like matrix multiplications are trivial due to libraries like numpy. In the case of Swift and iOS, there are not these kinds of libraries, or the documentation is poorly written. As a consequence, we have to make all the operations part by part and use a lot of "for" cycles. The following code is the implementation of the Non-Maximum Blending function that we learned about in the Face Detection for low-end hardware using the BlazeFace Architecture post. I recommend you to read that section to understand how this works. Here I will only show how we transform the output coordinates to the coordinates of the screen, thus, you should check the whole code in the repository, there are a lot of map, reduce and filter functions that compute things like intersection over union


let scaleY = imageWidth / imageHeight

let offsetY: Float = 0.20



var finalHeight = (finalYMax * scaleY) - (finalY * scaleY)

finalHeight = finalHeight * Float(cameraSize.height)





var y1 = (finalY * scaleY) + offsetY

y1 = (y1 * Float(cameraSize.height)) - Float(abs(cameraSize.origin.y))



var finalWidth = abs(finalXMax - 1) - abs(finalX - 1)

finalWidth = finalWidth * Float(cameraSize.width)



var xCoord = abs(finalX - 1) * Float(cameraSize.width)

xCoord = xCoord - Float(abs(cameraSize.origin.x))



var boxesRect: CGRect = CGRect.zero



boxesRect.origin.y = CGFloat(y1)

boxesRect.origin.x = CGFloat(xCoord)

boxesRect.size.height = CGFloat(finalHeight)

boxesRect.size.width = CGFloat(finalWidth)



let weightedDetection = BoxPrediction(rect: boxesRect, score: totalScore)

The scaleY variable is used to transform our coordinates from the cropped frame to the full image and obtain those pixels that we removed, the offsetY are the pixels that we have to move the origin. Since we are adding the cropped pixels, the origin is no longer at the same position. We can notice the multiplications by the cameraSize.height and cameraSize.width values, as we've previously seen, these are the sizes of the preview view adapted to the screen size, we also subtract the origin in X and Y, for instance in the iPad case where the preview view height is 1440, we multiply by this value and we subtract the excedent, in this case, 180 pixels from the above area. This is a way to transform the output coordinates to the screen coordinates, however, is not the only way and there could be better ways. We can observe how we subtract and get the absolute value from abs(finalX - 1), this is done since the frame is mirrored and also the coordinates.

Finally, we create a CGRect where we save all the coordinates for one box.

Results

In this implementation, we use the CPU of the device to run the model, in the case of an iPhone 12 (A14 CPU) the frame rate is around 70, and in the case of an iPad 8 gen (A12 CPU) is around 50 and an iPhone SE first-gen (A9 CPU) the frame rate is around 23.