This Is What a Robot Can See Now — 601 Objects, Live, Offline, on Your iPhone

Hold up a banana. Your phone says “banana.” Hold up a ukulele. It says “ukulele.” A stapler, a french horn, a goose, a CT scanner, a waffle iron — it names them all, live, at 10 frames per second, with the internet turned off.

That’s the part I keep coming back to. We’ve quietly crossed a line where a machine running on the 6-ounce thing in your pocket can recognize 601 different objects in the world around it without phoning anywhere. No cloud. No account. No waiting. That’s an extraordinary amount of sight to hand to a piece of consumer hardware, and it’s available right now, for free.

I built RealTime AI Camera to show people what that actually feels like. The app is on the App Store, it’s free, and the source is on GitHub. Point it at your kitchen and watch it label everything in real time. Most people don’t realize how far the on-device models have come until they see it happening in their own hand.

Here’s why it’s a bigger deal than it sounds — and what it took to actually build it.

80 things vs 601 things

Pretty much every iPhone object detection app you’ve ever used recognizes the same 80 things. That’s the default COCO class set that ships with YOLO — person, car, dog, cup, laptop, banana, the obvious ones. For years that’s been the practical ceiling on-device. If you wanted your app to detect anything beyond that, you sent the frame to a server, which meant goodbye offline, goodbye privacy.

RealTime AI Camera uses the much bigger Open Images V7 class set: the original 80 plus 521 more. Musical instruments by type. Animals you’ve never heard of. Kitchen appliances, tools, clothing specifics, scientific instruments, transportation, furniture subtypes. A model that understands the shape of “french horn” distinct from “tuba,” “goose” distinct from “duck,” “stapler” distinct from “hole punch” — all running on your phone, all offline.

The fact that you can walk around a house and have a device narrate the world back to you at 10 FPS without a network connection is a genuinely new thing. I keep meeting people who assume any “AI” on their phone must be sending data somewhere. Watching them flip airplane mode on and watch the app keep working is sometimes the first moment they actually believe on-device AI is real.

What the app actually looks like

Screenshots from the shipping app on iPhone — object detection, OCR, offline translation, LiDAR depth overlays.

RealTime AI Camera detecting objects live

RealTime AI Camera bounding boxes over 601 classes

RealTime AI Camera OCR or translation mode

Why 80 → 601 isn’t just “more classes”

Going from 80 classes to 601 is not a linear problem. The model has to learn all of it, and then it has to fit on an iPhone, and then it has to run fast enough to feel real-time.

Every one of those requirements individually is solvable. Stacking them together is where it gets mean.

The PyTorch → CoreML conversion

The weights started life in PyTorch. Apple doesn’t run PyTorch natively — everything on-device goes through CoreML, which means a conversion step. For an 80-class YOLO, that conversion is mostly automatic. For a 601-class YOLOv8 with Open Images V7 weights, I spent days on it. Some ops translate cleanly, some fall through into “custom layer” territory, some silently produce a model that runs but outputs garbage. The first few versions I converted ran at full FPS and detected nothing accurately — the tensors were all being reshaped wrong at inference time.

The published model is on HuggingFace: divinetribe/yolov8n-oiv7-coreml. If you want to skip the conversion pain and just use the model, that’s the shortcut I wish I’d had.

Hallucination at scale

When your class count is 7.5× bigger, you don’t just get 7.5× more possible correct detections — you get many multiples more wrong ones. The model starts seeing “musical instrument” in every shadow, “building” in every wall, “person” in every coat on a chair. At 80 classes the false-positive rate is manageable because the class boundaries are mostly clean. At 601, every low-confidence detection is a roll of the dice across a much bigger space of wrong answers.

Fixing this took iteration on three knobs: confidence threshold, NMS (non-max suppression) threshold, and per-class filtering. Some classes in Open Images V7 are just noisy — the ground truth in the training set was fuzzy to begin with. For the app, I tuned the thresholds conservatively enough that the user sees confident detections and not a constant light show of wrong guesses. That conservatism costs some recall, but the experience is night and day better than “technically detecting 601 things.”

Screens across multiple iPhones

This part I didn’t expect to be hard. SwiftUI is supposed to make responsive layout easy. In practice, running a live camera feed with overlaid bounding boxes + OCR text + LiDAR distance badges + a control strip, across iPhone SE through iPhone 17 Pro Max, is a real problem. Aspect ratios differ. The notch and Dynamic Island move. LiDAR is only on Pro models so the UI needs to gracefully degrade when it’s missing. Older iPhones can’t sustain 10 FPS on the bigger 601-class model so the app has to throttle.

Most of the hours on the “done except for…” list at the end of the project were UI hours, not ML hours. Nobody tells you that until you’ve lived it.

The bottleneck

The performance bottleneck on iPhone is not the Neural Engine — it’s memory bandwidth between the camera pipeline and the inference step. CoreML + Metal Performance Shaders + the Neural Engine are all fast. Moving pixel buffers from AVCaptureSession into a format the model wants, at 30+ FPS without dropping frames, is where you lose time. I ended up doing a lot of zero-copy plumbing — reusing the CVPixelBuffer that the camera hands you, avoiding intermediate UIImage conversions, keeping everything on-GPU through the pipeline. Average 10 FPS on the 601-class model across the iPhone lineup came from that plumbing, not from model optimization.

The shipped app has four features that all run in that same pipeline:

Object detection — YOLOv8, 601 classes, Open Images V7
English OCR — on-device printed text recognition (Vision framework)
Spanish → English translation — offline, rule-based + dictionary (no cloud translation)
LiDAR distance — per-object depth, on Pro models

All of it on one device, no network, no account, no ads.

Why this matters beyond the app

On-device AI is having a moment. Everybody’s talking about it, most shipped products still cheat by falling back to cloud when the model isn’t up to the job. RealTime AI Camera is a small proof that you can actually ship a nontrivial AI experience that never leaves the device, for free, and have it work.

This is the same thesis behind my other big open-source project, claude-code-local — running Anthropic’s Claude Code CLI against a local AI model on a MacBook, zero cloud calls, full coding experience. Different target (MacBook vs iPhone), different model sizes (31-122 billion params vs ~10 million), same philosophy: your data doesn’t need to leave the machine for the AI to be useful.

If you’re building in this space and thinking about on-device-first, I’d love to hear what you’re running into. Open an issue on the RealTimeAICam repo, or drop me a line.

Try it

Free on the App Store: RealTime AI Cam

Source on GitHub: github.com/nicedreamzapp/RealTimeAICam

Model on HuggingFace: divinetribe/yolov8n-oiv7-coreml

— Matt

Part of the Nice Dreamz Apps lineup — private, on-device AI tools. If you want this kind of thing set up inside a firm or practice, that’s AirGap AI.