Code Poetry
and Text Adventures

by catid posted (>30 days ago) 5:21am Sat. Mar 18th 2017 PDT
I've been reading up on ML and CV lately and noticed some things that seem to be largely overlooked:

We should investigate doing machine learning (ML) for computer vision (CV) differently:

(1) We should try modelling an eyeball scanning the world rather than running it against lens-distorted camera images.  This means we should add a new step to the front-end of image recognition (and other) pipelines so that we are convolving our convolutional neural networks (CNN) against what the fovea would see rather than weird distorted images.

(2) We should also try using a lower resolution version of the image data to decide whether or not to focus on each part of the image more deeply (using deeper neural networks) to save time.  For example don't scan for faces in the sky.

Discussion (1):

Our eyes see low-rez blurry shapes outside of the fovea (center of vision).  The fovea is a really narrow laser-focus region in which we see detail.  That feeds into a neural network (our brain) similar to computer vision pipelines (because CV is based on biology).  And in machine learning we are converging on using convolutional NNs - These are kernels that are applied per-pixel to an image.

But images are captured with cameras that don't just see the laser focus of the fovea.  They have a lot of warping due to lens distortion.  And we're applying neural networks across these distorted images.

And every camera is different - They have different intrinsics due to settings and lens choices.
Human being eyes are all the *same*  - We all have the same sized eyes and optics.

It's like our biology was designed to avoid the problem of having to deal with camera intrinsics.

So long as we're trying to imitate our own biology, why not also imitate our own approach to optics too?  ie. Apply a filter step before the neural network that eliminates distortion, simulating a swiveling eyeball looking directly at each pixel in the image.  This would be similar to what we do in VR HMDs to predistort the image.

We would have to know the camera intrinsics to do that, but throw a printed checkerboard pattern at it and we're there.

Maybe we can learn the intrinsics of a camera by adjusting the parameters around in gradient descent using the success of ML classifiers to guide the optimization?

It would be nice to have a general solution to this that doesn't need manual calibration because I see that everyone is just cropping out the edges of images before processing them =)

Discussion (2):

Also our eyes don't perceive all of reality at once - We focus on areas of interest.

Don't run the neural network on every single thing in the image - Use a lower resolution version of the image to guide where it should look.

That makes more sense when applied to video than single images; when we look at photos we look at everything too.  But when looking at video we ignore a lot, e.g. the invisible gorilla.

A friend found a project that is starting to look in this direction:
last edit by catid edited (>30 days ago) 12:20pm Sat. Mar 18th 2017 PDT