Machine Vision

Bob Seitz

Statement of the Problem:

Machine vision is a daunting challenge for roboticists. It has been estimated that computational speeds of the order of billions of operations per second are required to match the visual capabilities of the human eye. Even more daunting is the capacity of the human mind for visual recognition and understanding of what the eye can see.

In the animal kingdom, vision, working in concert with the other senses, would seem to satisfy at least two needs, with different computational demands.

(1) Navigation and obstacle avoidance in its environment. This presumably includes the construction of a crude, internally-stored 3-D terrain model.

(2) Recognition of objects in its environment. These include:

(a) Detection, recognition, and avoidance of predators (including recognition).

(b) Detection, recognition, and pursuit of food (including recognition).

(c) Awareness and identification of other objects in its environment, such as other animals in a herd.

Capabilities of Human Vision:

Since hundreds of millions of years of product improvement have been invested in the animal eye and millions of years of such fine-tuning have been expended on the human eye, it seems reasonable to assume that it is highly optimized for survival. Consequently, we can probably learn a lot from its design.

The retina of the human eye harbors about 100,000,000 photo-receptors. Of these, about 94,000,000 are rods, used for night vision, and about 6,000,000 are red, green, and blue (RGB) day-vision cones. For some good reason, only about 7% or 420,000 of these sensors are sensitive to blue light. This would seem to imply the presence of at least 2,800,000 red and 2,800,000 green day-vision pixels.

The foveal region of the ideal human eye can resolve down to Dawes limit. and spans a 2° angular "vision cone". Assuming a daytime pupillary diameter of 1/10th inch, Dawes limit for the eye is 40" of arc for green light (5,000Å) or about 1 part in 5,000 (1 foot at a distance of 1 mile). However, resolution falls off linearly on either side to only about 45 arc-minutes or 3/4ths degrees at the ±90° extremes. Perhaps nature has designed us this way because higher-resolution peripheral vision isn't essential to survival even for a primate in the wild, and this strategy greatly reduces the neural resources necessary for vision. This kind of resolution could readily be provided by using two video cameras, one for high-resolution central vision and one for wide-field peripheral vision. About 6,000,000,000 neurons in the brain are dedicated to visual processing or about 180 neurons for every color-triplet pixel. As noted in the Machine Intelligence plan, they each interconnect with something like 15,000 other neurons—a formidable processing array, indeed! Since only 6,000,000 or 6% of the eye's receptors are dedicated to color day vision, we might assume that only 6% of the visual cortex's 6,000,000,000 neurons or 360,000,000 neurons are dedicated to day vision. (Of course, image processing is image processing, and it may be that most of the circuitry in the visual cortex is at work whether it's processing daytime imagery or nighttime imagery. However, for the moment, we'll assume that separate circuitry processes the two types of input.) The eye's daytime update rate is 20 frames per second and the visual cortex fires at 40 frames per second. This would seem to imply 14.4 billion firings per second for the 360,000,000 neurons assumed to be associated with daytime vision. Since each neuron interconnects with about 15,000 other neurons, there are about 7,500 synapses shared by the 6 billion neurons for a total of about 45 trillion synapses, with 2.7 trillion of them dedicated to daytime vision. If these all fire 40 times a second, the connection update rate for the 360,000,000 neurons in the daytime cortex would be about 108,000 Gcups (giga-connection-updates-per-second) or about 10,000 times faster than our neural networks can presently muster. Of course, there may be a high level of redundancy to accommodate the inherent unreliability of biological networks. Also, a lot happens in the various levels of the visual cortex including curve fitting, feature extraction, object recognition and the issuing of orders to the motor areas.

The update rate for nighttime vision is 5 frames a second, perhaps to ease the computational burden of processing information from 94,000,000 photoreceptors. (Also, if my speculations are correct, it is more difficult to recognize objects in black-and-white imagery than in color imagery.)

Dr. Hans Moravec has estimated that it would require about a billion computations per second to match the Marr-Hildreth filtering, edge, and corner detection that takes place in the first four layers of the human retina.

The eye aggregates the inputs from sets of its photoreceptors, sending an input to the brain only when it sees a bright spot silhouetted against a dark background or a dark spot displayed against a bright background. This reduces the level of detail of the signals transmitted down the optic nerve by a factor of about 100. As a result, there are about 1,000,000 optic nerve fibers carrying the eye's signal to the brain from the 100,000,000 photoreceptors. If about 6% of these fibers are devoted to daytime vision, there would only be about 60,000 day-vision signals sent to the brain from each eye. In the visual cortex, the dot images are replicated (to produce 18 or more copies) and sent in parallel to various areas of the visual cortex. "Simple" neurons look for vertical, horizontal, and slanted (at 10° angular increments) line segments in very small areas of an image. From here, the results of all of the 18 angular sets of simple neurons (from 0° to 180°) are transmitted to "complex" neurons which detect slanted line segments anywhere in the image. These in turn transmit their results to hypercomplex neurons designed to detect corners, which form the highest level in Nature's visual-cortical hierarchy.

Nature, in order to implement vision with slow circuitry, must process visual data with a high degree of parallelism. Silicon systems, by contrast, should be able to process visual data with less than 1/1,000,000th of the parallelism of natural systems. In fact, one or a few very fast processors may be sufficient. When we factor in the reliability of silicon and the observation that machine vision systems don't need to be able to detect camouflaged predators against a masking screen of foliage, it may be possible to effect machine vision with significantly less computing power than is required for animal vision.

Simplifying the Problem:

There are a number of ways in which it may be possible to simplify the robotic vision and recognition process.

(1) Reducing Resolution

The perfect human eye has a resolution of, perhaps, 30 to 40 arc-seconds.

If we were able to afford two 2-megapixel cameras for each eye, one for high resolution and one for low resolution, we could provide 40 arc-second resolution over an 18° field of view, surrounded by 6 arc-minute resolution over a 160° horizontal field of view and a 120° vertical field of view. (If the cameras had circular CCD sensor arrays, one could achieve a 160° circular field of view rivaling that of the human eye.) However, we would need four such cameras, two for each eye, and their present-day cost might be prohibitive for amateur or production use. Also, such a strategy would require processing 8,000,000 pixels for the four cameras 20 times a second for a total of 160,000,000 pixel computations a second.

However, most human eyes aren't perfect. If we are willing to relax our resolution requirements to 60 arc-seconds, we can reduce both the image processing demands and the costs, sizes, and moments of inertia of our color cameras. One consideration which helps to establish our choices has to do with available video cameras. Small (1.5-inch cube), inexpensive ($400) color video cameras are available with resolutions of about 750 pixels X 550 pixels (410,000 pixels). Two of these cameras, one a high resolution camera with a 24-mm., 12° field-of view lens, and the other a lower-resolution sensor with a 4 mm., 72° FOV lens, would form a matching pair. The central camera would support 1 arc-minute resolution (1 part in 3,333) which would be computationally trimmed to 6 arc-minute resolution (1 part in 570) at the 6° angular edge of the its cone-of-vision. The wide-angle camera would then pick up the rest of the image seamlessly at its 6 arc-minute resolution which would be computationally reduced, by ignoring pixels, to 36 arc-minute resolution (1 part in 95) at the 36° limit of its angular field-of-view. Not only are these small cameras more affordable but in addition, like the human eye, they have low moments of inertia and would seem to be well-suited to high speed pan and tilt,

We will assume the use of four 410,000-pixel digital color cameras, two for each eye.

(2) Reducing the Angular Range of Vision

Low-resolution peripheral vision is necessary in the animal kingdom to reduce the chances of a predator sneaking up on us. Since a first-generation machine intelligence isn't going to have to survive in the wild, its peripheral-vision capabilities, as well as its ability to detect motion, can probably be relaxed beyond those of the animal kingdom. While this would not reduce the required camera resolution, it cuts the image-analysis computational workload which, right now, is a formidable constraint in robotics and machine vision. A 60° to 120° angular field of view may be quite sufficient for a first-generation robot. Resolution degradation, which is the rule in the animal kingdom as we go from the center to the edge of the field of view, should also be acceptable in early robots. Like the human eye, the computer might process at a maximum resolution over a ±1° foveal region. The computer could drop to half that resolution over the next conical degree by averaging over 4 pixels, then to a third that resolution over the 3rd conical degree by averaging over 9 pixels, and so on, until at 37.5°, resolution would have fallen to 37.5' or 5/8ths degrees of arc (2/75ths that of the foveal circle). In the ±1° foveal circle of most acute vision, with 60 arc-minutes of resolution, the pixel density would be 3,600 pixels/square degree. Using a circular processing pattern would imply 11,300 pixels in the ±1° foveal circle. This strategy leads to a cumulative pixel count of about 92,000 pixels for the whole 75° vision cone or a little more than 10% of the two cameras' 820,000 pixels. Other, more-optimized strategies might be essayed if image processing workloads are still too demanding. Restricting the central cone of acute vision to ±40' and reducing foveal resolution to 80" of arc would lower processing requirements by another factor of 4 or to about 23,000 pixels. Of course, image processing speeds are rapidly increasing and may not be a constraint two years from now. My calculations for the human eye lead to a total pixel count of about 450,000 pixels per eye, rather than the nearly 3,000,000 that are observed. (This is not a comfortable conclusion. Nature is very parsimonious with her resources. If it takes Mother Nature 3,000,000 pixels to provide human vision, it seems most unlikely that yours-truly will do the same job with 450,000 pixels.)

(3) Relaxing Motion-Detection Update Rates

Motion/change detection is also critical for the detection of predators and prey. Again, this requirement might be relaxed for a first-generation robot, so that, on average, it checks for motion perhaps five times a second rather than 30 times a second. Thus, we might process 460,000 pixels per second per eye or 115,000 pixels per second per eye. By contrast, the human eye must process at least 450,000 pixels 20 times a second for a total of 9,000,000 pixels per second per eye or at the higher figure of 2,800,000 pixels, 56,000,000 pixels per second per eye for a total of about 112,000,000 pixels per second. No wonder there are 360,000,000 neurons in the day-vision visual cortex!

A more advanced updating strategy might be to update the region of central vision more often than other areas, since things which are moving toward us will tend to lie in this zone and will tend to expand rather than translate. This region is important because it lies dead ahead. Also, objects entering our field of view will tend to do so at the periphery of our vision. Here, the probability of change is high. We might also want to update interleaved pixels at even lower resolutions than the already peripheral resolutions to optimize our chances of detecting changes. For example we might check one out of every six pixels every 30th of a second, and then another neighboring one-out-of-six pixels in the next camera frame, so that over the course of a fifth of a second, we will have checked all the pixels that we need to process while at the same time, raising our probability of picking up a change.

Rapidly-translating objects would pre-empt other visual-processing operations and would be examined with the maximum resolution and update rates which the cameras can provide. However, wind movements would have to be identified as such and ignored. We are only interested only in movements that translate across our field of view over time.

The computational workload for our machine vision system should be no more than 1/20th that of the human eye, which might bring it down to the level of reasonable present-day signal processing capabilities—hundreds of millions or perhaps billions of calculations per second.

(4) The Fundamental Theorem: Taking Advantage of Prior Knowledge

This is so important, it deserves a whole chapter.

(a) We usually know where we are at all times.

(b) Although the human eye and brain is wonderfully capable and can handily recognize objects in unfamiliar scenes, including those seen in two-dimensional photographs, we normally have already recognized all of the objects in our environment. Furthermore, we carry in our heads a 3-D "shell model" map of familiar surroundings. We typically move only an inch or two in the 1/30th of second between successive video frames. Consequently, knowing our location at some initial time t0 and knowing the locations of landmarks in our surroundings, we can

(a) visually navigate from point to point, updating our position by triangulating on landmarks, updating our 3-D map using our location (a bootstrapping operation) and

(b) check the objects within our environment only for object motion or for unexpected changes.

This should greatly reduce the computational workload by obviating the need to recognize everything in the field of view every 1/30th second or even to carefully recognize most objects at all.

Registration of images should be simple. At three feet a second, we will move slightly more than an inch from one frame to the next. We will know our prior location and velocity from the last frame and can predict our current location and camera orientation in the current frame 1/30th of a second later. Knowing our location, we can calculate the expected 2-D projections of all the 3-D objects and their apparent cross-sections, scales, rotations, and translations. Since we will have moved so little, we can probably use linear extrapolations rather than detailed calculations to produce these estimates of their 2-D image-plane coordinates and cross-sections. Then we can match up these predicted cross-sections and locations with the current, observed image to identify and register the colored blobs in the observed image. Next, we can correct our previous estimate from the observed edges and cross-sections in the current image in order to make updated projections for the next image. This way, we'll be using Nature to provide the exact calculations of the apparent cross-sections of objects when viewed from slowly-changing locations.

When changes occur very slowly, we might even get away with no extrapolations whatsoever, but might simply use the prior image to register and validate the slightly- altered object projections in the current image, and then to morph the prior image into a new, slightly-different updated image to compare with the next frame.

When the robot enters an unfamiliar scene, it should be acceptable to take a few seconds to recognize the most urgent information in that scene, and a few minutes to identify additional features within the scene. After all, this is what we do.

(5) Video Camera Problems

The Need for Wide band Video Cameras

Dealing with Camera Noise:

If video camera noise is a (hardware) problem, several solutions suggest themselves.

• We might over-sample, using higher-resolution cameras and then a majority vote among three or more pixels to determine whether a given anomaly is valid or is noise. (Could this be why the human eye appears to have several times as many pixels as we actually observe?)

• We might stop the robot long enough to take several identical frames and then use a majority vote.

• We might compare successive frames pixel by pixel after morphing them to allow for corrections arising from motion.

• The planned policy of predicting the expected scene based on dead reckoning before the robot sees it could allow us to compare the last and prior frames with the current frame for noise reduction.

• If determining the cameras' position is a problem, we can use differential GPS to locate the cameras to within a couple of inches.

• There ought to be low-noise cameras on the market to facilitate what we're trying to do. It might be desirable to use two of the new digital camcorder cameras, although these presently cost several thousand dollars.

• Cooling of the camera chips might help. If the thermal noise figure is influenced by semiconductor band gaps or some other kind of threshold phenomenon, a Peltier cooling technique might work. (These are available from Service Merchandise in the form of automotive coolers.) Otherwise, liquid nitrogen might be required (not a practical solution in the long-term).


Dealing with Vehicle Vibration, Tilt and Sway:

There would seem to be three ways to correct for a bumpy ride.

• Use gyrostabilized video cameras.

• Use electronically-stabilized video cameras.

• Try to maintain registration of the imagery produced by the cameras. With our incremental approach to motion and our predictions of how the next scene should look, we should be able to do that on the move.

Dealing with Camera Pointing Inaccuracies:

In addition to the above three methods for correcting for camera pointing inaccuracies, we might also use two or three 12-bit optical encoders to provide angular positioning information accurate to one arc-minute.

High Data Rates:

• Two 410,000-pixel cameras would spew out data at a 73.8 megabyte per second rate. That information will probably have to be pre-processed at the full rate before it is compressed. If we had to go to oversampling with 2,000,000-pixel cameras, we would be looking at 360,000,000 bytes a second. This may have to be processed onboard the robot before it can be compressed and sent to a PC or workstation (probably after a majority vote and edge detection has occurred). (Alternatively, a fiber-optic or coaxial cable might provide the bandwidth necessary to bring this kind of data deluge back to a bank of fast signal processors but it would mean that the robot would have to be tethered.) The processing speeds for handling this fast a data rate seem prodigious by today's standards (and probably ho-hum ten or fifteen years from now). One way to accommodate this would be to cut the frame update rate way, way down to the point where a desktop PC could handle the processing workload.

Another approach might be to process only the approximately-100,000 reduced- resolution pixels at a 5 frame per second update rate. Actually, the central-vision camera that covers a ±6° angular range will have to measure z-axis coordinates by the rate at which objects appear to enlarge as we approach them rather than by triangulation, so we might be able to live with noisy video from this central-field camera. If we have to oversample, we might cut the resolution to 100" of arc in the central field and 10' of arc in the peripheral field. This would allow us to continue to use the 410,000 cameras described above. Another approach might be to look for sudden anomalies in the inputs from any of the three RGB channels, ignoring them if there is a sharp change in one and not in the other two.

The above is an off-the-top-of-the-head set of ideas for circumventing hardware inadequacies. There may be other candidate remedies that would help alleviate these problems if they are significant.

(6) Taking An Object-Oriented Approach to Vision

Our visual world is constructed of objects. Even grass and the sky are objects. A more advanced strategy would consist of varying the resolution in ways that are optimized to detect edges of objects—for example, using higher or even maximum available resolution to map out lines and other critical features. Resolution right at the periphery where moving objects are apt to enter the field of view might be elevated to the full 6" of arc that is available to the camera. For the cameras described above, this would add about 2,160 pixels to the total processing workload. Any objects that are moving would probably warrant high-resolution attention.

(7) Determination of Z-Axis Coordinates

May use lateral movement of the camera (triangulation) to estimate distances to objects. A machine vision system can make precise calculations of angles and of the distance moved, and can precisely calculate the distance to a stationary target (or even to a moving target once its trajectory is known). Camera motion can also afford very important clues to the differentiation of objects. Object pixels will shift together as a group, with different objects (at different distances) tending to move at different rates.

(8) Use of Range finders

May use an ultrasonic or a laser range finder to measure distances.

(9) Overlapping Fields of Vision for Wider Fields of View

Although each eye can see throughout a full 180° range, the nose, the brow, and the body restrict vision to less than that full span. If stereo peripheral cameras are used for stereo vision, they might be arranged so that their fields of view overlap in the center but pick up extended side views. For example, two 75° stereo cameras might cover a 105° field of view with a 45° overlapping field of view in the center.

(10) Differential GPS

Two Competing Approaches (Analysis of Individual Details vs. Aggregation and Averaging to get the Big Picture):


Putting It All Together

For the robot entering a familiar room:

(1) Its image processor will know where it is moment by moment as it enters the room.

(2) Its image processor will consult its 3-D model of the room that the robot is about to enter.

(3) Its image processor will also carry along an image of the last scene the camera has recorded, in addition to the 3-D mo.

(4) Its image processor will predict its new location by dead reckoning from its known location in the last frame.

(5) Its image processor will calculate the projections of all the object edges from its 3-D model, as modified by unexpected changes (if any) gleaned from the immediately preceding frames.

(6) The image processor will hunt for the edges and surfaces of the known objects by sampling along the edges, and randomly sampling the "interior" surfaces.

(6) The vision system will calculate horizontal pixel differences, looking for vertical edges and matching these up with the predicted object edges from its stored 3-D model. Then it will map out non-vertical edges and areas and match them up the stored 3-D model. These comparisons will occur at the reduced resolutions of peripheral vision. The rationale behind comparing what it sees with its stored model rather than with preceding frames is that if changes are occurring slowly, it might not pick them up from recent frames, whereas if it compares what it sees with its stored recollections, it will detect any changes that have occurred since its last model update. (Prior models will be stored at some level of detail to provide a "corporate history" of how things were in the past.) Of course, the level of detail of the stored model may be so low that augmentation of the stored model with an examination of immediately-preceding frames might be in order.

(7) It will use a few landmark edges to correct its estimated position.

(8) It may use the edges abstracted from the scene to update the projections of its internal 3-D models, rather than trying to calculate and render the 2-D projections of its stored 3-D models for each 1/30th second or 1/5th second scene. Thus, we can use iterative mechanisms to reduce the computational burden.

(9) As the robot moves into the room, its vision system will use the baseline generated by its motion to triangulate on all the objects and features in the room, thereby generating or updating its 3-D model of the room.

(10) If the vision system detects a deviation from its stored model, it will pause and examine the deviation.

(11) The vision system must keep checking what it sees for the purposes of change detection. It will be object-oriented, and will check for new objects and changes in existing objects as opposed to comparing individual pixels.

(12) Considering the rate at which improvements are taking place in image processing hardware, it may be unnecessary to be concerned about reducing the computational workload. This problem may soon be overtaken by events.

May use coarse polygonal models for "ray-tracing". Might store critical angles at which hidden parts of an object begin to appear. Hey! Don't need ray-tracing! Will project the perceived image on our 3-D model. Might need to alter the perspective view of the 3-D model.

Might not even bother to update the 3-D model everywhere. Might simply observe the scene as it appears.

A first-generation robot could move and process at glacial speeds if necessary, realizing that higher speeds will become available at a later date.

• Separation of Luminance and Chrominance:

Separation of a video input signal into luminance (intensity) and chrominance (color) is probably desirable because we are trying to identify objects and objects are identifiable by color rather than by illumination (because of tricks of light or shadow). To accomplish this, it is necessary to convert to spherical polar coordinates. The magnitude of the resulting color vector will represent the intensity (luminance) of a pixel, while the inclination and azimuth angles will define a point on a color sphere (or color wheel if collapsed to two dimensions). If the three color components are positive 8-bit integers (0-to-255), then the angular coordinates of the color space will be confined to the first octant. If both positive and negative integers are permitted (-128-to-127), then all azimuthal values from 0° to 360° and all inclination angles from 0° to 180° will be present.

Computational Burden: Requires calculating the square root of the sums of the squares and two arc tangents for 450,000 pixels a second or 1.35 megabytes a second. Alternative: Since these are discrete numbers, use lookup tables.

• Will calculate horizontal differences first, noting differences that are greater than the noise threshold and the lengths of those vertical lines. Will store the maximum and minimum chrominance and luminance values and the maximum and minimum differences. Will sort the vertical lines

• Frame to frame, when operating in a familiar environment:

We can ping-pong back and forth between the predicted model and the perceived image, using the perceived image to update the predicted model. For the hidden line problem, we can use the real-time image, matching it to the 3-D silhouette model from the preceding frame. (Object silhouettes are what one would use to blot out what lies behind them.)

• 1st Look: All images will be standardized: rotated, scaled, and illuminated to a standard format.

• Always carrying expectation of predicted image.

• Will record details of high-res field wherever the gaze is directed. Gaze will be attracted toward the unexpected, the moving object. Because things run on automatic, inhibition will be the conscious default rather than choosing by volition. (However, there must also be provision to ignore bounded motion of bushes in the wind, and so forth.)

• Conscious memorization will be possible by reviewing scene or event just witnessed. Traumatic events will also be recorded. Full-res video and audio recording will occur and will be retained for at least a few seconds after witnessing. Storage of bout 1 to 1.5 MB per second will be required. Could record directly to disk or tape (passing through a small RAM buffer).

• High-contrast edges will be detected and matched first. Will be matched against expected 3-D outline models for these surroundings. Will store in crude form, filling in from remembered colors using generic textures. Might store separate, skeletal 3-D shots of views from doorways—that is, what is seen when entering scenes. Exact sizes, color, textures needn't be remembered.

• To generate 3-D models, use:

- stereo vision

- time-frozen stereo views generated by moving around (head-motion)

- ultrasonic range finder

- optical range finder?


• Will count on color differences to facilitate object recognition. Will use continuity of color on one side of an edge to help establish boundaries.

• Surfaces of objects would be better candidates for mapping than edges because edges (as well as surfaces) can be interrupted by bushes, etc. Also, it's hard to tell which side of an edge represents a continuous object without mapping out the nearly-the-same-color surface.

• If the color is approximately constant on one side of a boundary, it will suggest an object. Could map out objects having a given approximate color.

• Once objects have been differentiated, their template may be carried from frame to frame. They can be sharpened up from frame to frame: intra-pixel sharpening should be possible by changing the camera angle minutely.

• Special provision can be made for detecting straight lines. Circles and ellipses can be test-fitted to objects. Curves may be fitted using cubic splines. Extended regions may be standardized by identifying their long axes.

• An ultrasonic range finder could be used to determine distances and to map out 3-D objects.

• Associations will all have weights attached to them. Size could be a factor. It may be desirable to store image outlines at several perspectives to facilitate recognition. Attributes and structural decompositions may be necessary or helpful, since exact matches aren't going to be feasible. Shrinking and stretching of the stored image along its long and short axes may be necessary to achieve a match. Matching should be based on such things as two long horizontal lines and two short vertical lines, or a horizontal cylinder.

• Will want to map out areas of similar texture. "Similar" means using variable resolution. Will be alert for repeating patterns. Will be seeking symmetries and other regularities that must then be tested by the vision system. May want to use coarse-grained dots and surround-inhibition like the eye. May want to use short straight-line segments, with special responsiveness at line ends. Will use all available higher-level knowledge about the image. Will use functional knowledge. Will be trying to map out objects. All of this will be used to detect edges. Edge detection will be very much an interactive process. We won't try to detect edges and then features and then recognize objects. The process will be iterative. Location, the abstraction of functional requirements, the sense of what's proper will all enter in.

The Curious Case of the Cat

Consider thou now the curious case of the cat.

If you were to see the images shown below, you would probably say, without a moment's hesitation, "That looks like a cat."


The implications of this are startling. There are no clues here other than a portion of a cat's approximate silhouette. We could probably still make the identification if portions of the lines were missing and if the cat's head were turned in a different direction. The usual clues of size, color, and texture are missing, along with all other associations, and yet, we're still able to identify this as representing a cat. We're so smart!



What are the significant features of a cat?

• Whiskers.

• Nose

• Tail

• Ears

• Slit eyes

• Shape of head

• Jerky use of paws.

• Brindled fur

• "Meow!"

• Purring

• Licking fur

• Slinking

• Walking

• For example, a cat's ears form two triangles, but they are two triangles mounted in a certain way.

• We can recognize the cat's triangular ears given only interrupted segments of the triangles.

• We can recognize a cat's silhouette even if it's tilted laterally and toward or away from us.

• Strong correlations from common connections. We use a wide range of cues to recognize what's around us, and most of it we don't even try to recognize. (If this is Tuesday, we must be in Belgium.) Location, appropriateness, relevance. Color can be an important attribute (the Olive Garden sign, a traffic light). Recognition consists of a set of clues, any one of which may be sufficient to pin down the identification process but which may be used to steer the computer to the obvious conclusion. A green light located over a road is probably a traffic light. Traffic lights correlate entirely with roads, and strongly with urban areas. Cross streets are associated with them. Clues: green/yellow/red light over a road. Correlated with roads, railroad block signals, intersections (though not all intersections have traffic lights, and not all traffic lights are at intersections). Point: when we see a red/yellow/green light over a road, we don't have to analyze its shape. If it's round and if it changes with time, we assume that it's a traffic light. Shape is only one of very many correlations used in identification.

* A keyboard suggests a computer, a word processor, or some terminal input device. A keyboard is characterized by a lot of vertically-oriented little buttons evenly spaced and laid out in a rectangular format.

Mechanism: When a line segment lies along the projection of another line segment to within 1 to 2 pixels, it is an odds-on bet that the two line segments are part of one interrupted line. This inference is buttressed even further if the line segments point toward an actual or implied corner. Then by extension, we can connect curves with missing segments by using cubic splines, subject to the constraint that the curvature not change sign. Will want to calculate the slopes of line segments. May do as the eye does and connect patterns of dots and line segments.

• Line Segments Would Intersect. When occluded line segments would intersect in 3-D, this is much stronger circumstantial evidence of a corner and of an occulted object than a 2-D corner.

• The role of the floor/ground and of the horizon. (There will normally be a floor/ground and a horizon.)

• Global (top-down) vs. local (bottom-up) analysis of imagery. Variable resolution. Repetitive patterns.

• The mounting of the ears and the shape of the head is crucial. And it doesn't matter one iota whether the cat's head is tipped to the side or is straight across.

• The all-important role of the setting of the scene. The equally-important role of expectations. How well would a new-born infant perform in a laboratory setting? If we're inside, there will be a floor, a ceiling, walls, doors, and maybe windows. If we're outside, there will be the sky, the ground or water, and then buildings and/or foliage.

• Continuity of color, texture, or pattern on one side of a line points toward a meaningful edge. Generally, if there is no change in color but a change in texture, a meaningful edge is also indicated. Abrupt changes in z-slopes are an important clue.

• Z-axis information is crucial to recognition. Z-sort, Z-ordering. Start with the closest tile (object).

• Variable resolution and averaging can take place by averaging four-pixel areas and then aggregating them into larger and larger clumps. Might identify general areas at a very crude level first, in terms of color, brightness, etc., followed by a finer and finer examination of detail.

• Don't need to recognize everything in the field of view.

• Need to look for repetitive patterns to establish the extent of objects.

• The role of symmetry and of repetitive patterns.

• We truly do store objects as incomplete shell models.

• In the recognition process, it would be better if the stored models of what we are trying to recognize were rotated into congruence with the observed, unidentified objects. Unfortunately, it would require a lot more processing power to rotate each stored model into alignment with the observed image than it would to rotate the stored image into a standard cross-sectional projection format in which the planes of the observed image are aligned perpendicular to or parallel to the direction of view for those objects for which this is possible.

An alternative approach would be to store many different views of each remembered object. However, the vision system will have to attempt to rotate observed image shells into standard orthogonal projections in order to compare them with stored models.

Idea: We may need to store front, side, top, and sometimes, back elevations of objects for shape matching. May want to store

• We may want to store and compare 3-D shell models of what we're seeing, rather than comparing 2-D silhouettes. (However, we are also able to recognize objects from 2-D line drawings.)

• Pre-classification will be the big job. Would like classification algorithms—e.g., could store points, slopes, and corners at 10° angles. Could store a set of basic objects, like parallelepipeds, pyramids, spheres, ellipsoids, spirals, cylinders, tubes, etc.

Alternatively, could store infinitesimal line segments, with slopes. Could then use cubic splines to represent shapes. Would need five values per point. Could store vertices or could store line segments at the vertices. What about a spiral? What about infinitesimal surface elements? Patch surfaces? May want to use the above generic shapes in addition to cubic splines.

Answer: We're going to need both 3-D and 2-D storage. A coil spring requires a linear (as opposed to a surficial) representation. Pictures would lend themselves to 2-D representations, although the objects they represent are 3-D.

• In comparing shapes, we need to retain qualitative characteristics. For example, in storing a guitar shape, we need to ensure that the "bottom" of the violin is wider than the top (a very important idea). Otherwise, we need to set up a different category.

• Things come in functional sets. Sinks have faucets and faucet handles, drains and drain stoppers, and one or two feed pipes and a drain trap underneath. (However, we could recognize a sink even if it didn't have any plumbing associated with it. We could even recognize a cup-shaped indentation as a potential sink, based upon its functional possibilities.) When we see a table, we expect chairs.

Could have pointers to all triangle shapes. Will use the idea of weak multiple constraints to identify. For example, size is a weak constraint. (We could identify a cardboard box even if it were larger than anything we has ever seen before, so size is not a guarantee of anything—merely a clue.)

• Location and function could be very quick guides to identification, guiding the recognition system into the right neighborhood.

• We're going to need silhouettes viewed from at least the sides and the front. We may also need them from the top, the back, and, in some cases, the bottom. Given the 3-D shell model, we can generate these views as needed.

• We may be able to determine the angle of view directly from the image. This would allow us to calculate perspective views and the corresponding silhouettes from the stored 3-D images.

• Most visual recognition comes about not through disembodied object recognition but through other clues. Object recognition only confirms what other clues have indicated.




The following material (which will later be relegated to an Appendix) is being abstracted from the book "Wet Mind", written by Stephen M. Kosslyn and Olivier Koenig (Harvard University). This excellent book discusses attempts to infer brain processing flows and centers from a functional analysis of what the brain must do in order to process information as it does.

The field of brain research has been plagued with acrimonious debates from its inception. In the first half of the 19th century, the debate raged first over phrenology: Were such functions as acquisitiveness, sublimity, secretiveness, veneration, firmness, hope, and parental love localized in the brain, and could one measure their intensity by measuring enlarged areas on the skull? In the latter 1800s, animal experiments and observations of brain-damaged humans began to point toward a global model of mental faculties. At the same time, growing evidence indicated that there was localization of brain function but not at the high levels postulated by the phrenologists. Again, debate raged between the localists, led by Broca and Wernicke, and the globalists, rallied by Jackson and Karl Lashley). We know today that both camps were partially correct. There is a great deal of localization of function within the brain,but other areas can substitute with degraded performance ("vicarious function") when necessary.

In the 1970s, an AI researcher by the name of David Marr began to try to "reverse engineer" the brain. Dr. Marr observed that edges constitute zeroes in the second derivatives of intensity functions (Figure 1). Sets of such zero-crossings can often be connected with small line segments, blobs, and soon. The brain differentiates between edges and texture markings in two ways: (1) by using on-center/off-surround neurons and off-center/on-surround neurons, and (2) by using variable resolution neurons.

Edges, unlike texture and surface markings, show up at many levels of resolution.


• Find locations with sharp changes in intensity (where the 2nd derivatives of the pixel density values change sign),

• Connect up these sharp-change dots and micro-areas to form line segments and blobs, and

• Select changes that are present at different levels of blur.

Neurons with the right characteristics perform the equivalent of taking the 2nd derivative of the input.

This is far from the whole story. Color, texture, relative depth, etc., enter into recognition.

Five Principles of Neuroscience

1. Division of Labor

The brain uses one subsystem to recognize the meaning of a word, independently of where it's encountered, and another subsystem to recognize accents or handwriting style. Similarly, independent subsystems are needed to loosely recognize objects while affording precision location information for motor functions to reach and navigate.

2. Weak Modularity (Global vs. Local)

Neurons in visual area MT work with subsystems that distinguish shape as well as subsystems that track moving objects. Neurons in various parts of the brain may support a given function. However, in the main, functions are localized. Adjacent neurons interconnect with each other to perform a given function

3. Constraint Satisfaction

The brain tries to use multiple cues simultaneously—e.g., using both continuity of color and line orientation to locate edges. Input/output mappings are not precise, ("coarse coding" or "fuzzy logic") which lends itself to generalization. Different inputs can produce the same outputs. Coarse coding (fuzzy logic) uses a the overlapping of weak constraints to produce a strong constraint.

4. Concurrent processing

Parallel competition.

Serial Cascades

5. Opportunism

We perform a task using whatever information is available, even if that information typically is not used in that context. The frontal lobes may have evolved to guide fine motor movement but may be pressed into performing arithmetic and other serial activities.

Some Anatomical Definitions

The sulcus is a cortical fold; the gyrus is a bulge.

The fissure of Roland (central sulcus) divides the frontal lobe from the parietal lobe.

The fissure of Sylvus (lateral fissure)divides the temporal lobe from the frontal and parietal lobes.

Congress has designated the 90's as the Decade of the Brain.

In discussing the way in which early AI researchers underestimated the magnitude of the problem of machine vision, they asked a student to build a machine vision system as a summer project.

Von Neumann computers distinguish between programs and data; neural nets do not. Recurrent (feedback) networks, Often, neural networks can give the right answer when the input data is incomplete (vector completion). Conventional computers use recipes which specify rules for transforming the input to the output; neural nets are wired in such a way that the transformations from inputs to outputs can be described using rules. Mapping inputs to outputs (functions).

The Information Flow Diagram

Information Lookup to:

Attention Shifting to:

Attention Window in:

Visual Buffer to Both:

Object Properties Encoding and

Spatial Properties Encoding, and

Both to Associative memory to:

Information Lookup.

The ventral (occipital to lower temporal) system deals with recognizing object properties such as shape, color and texture (inaccessible to other sensory modalities), and calls for very coarse precision in order to facilitate pattern matching.

The dorsal (occipital lobe up to parietal) system mediates location properties such as location and size. This must be done precisely to support hand-eye coordination and eye motion to peripheral stimuli. (These have been dubbed the "what" and where" systems). Information emanating from the various sensory inputs comes together in (distributed) associative memory (name, categories, sound it makes, etc.) As an example of a situation in which additional information must be solicited to make a positive identification, consider a dog on back with its legs splayed. This would obviously be a problem for a recognition system that relied solely upon shape. For this example, the brain would need to use top-down processing with the frontal lobe generating hypotheses and drawing upon previously-stored information. The brain (presumably) cycles until it identifies the hypothesized object.

Shapes are matched in the ventral lobe, while spatial properties are extracted in the dorsal system. Note that the right halves of the retinas of both eyes transmit to the left hemisphere while the left half retinas are wired to the right hemisphere. Somehow the two halves of the images are processed as a whole.

The brain uses simple neurons highly tuned for specific shapes in the lower temporal lobe.

Experiments were run at Harvard using back propagation neural nets to recognize shapes and locations. The results of this experiment indicated that identifying the shape is more complex than identifying the location.

A stimulus-based attention shifting subsystem is indicated. For neonates, a phenomenon called "perceptual capture" occurs in which sub cortical structures force the infant's eyes to follow a peripheral stimulus involuntarily (motion capture). After the infant reaches the age of about two months, the activation of the visual cortex gives the infant the ability to disengage from peripheral stimuli.

The Object Identification Processing Chain versus the Location Pathway

The geniculo-striate pathway to the thalamus mediates object identification while the tecto-pulvinar pathway plays a key role in orienting attention toward the stimulus. But we can identify objects seen out of the corner of the eye. When the geniculo-striate pathway is damaged, the eye will still be drawn toward stimuli even though they can't be recognized.

The Visual Buffer

Input from the eye (primarily via the geniculo-striate pathway) arrives at the visual buffer on the surface of area V1. Images are retinotopically organized, though slightly distorted and magnified at the center. Most of these areas register only half the space. 32 retinotopically areas have identified in a macaque monkey's brain, covering over half the cortical surface. This image stores edges, and color and texture maps, together with relative distances from the viewer. However, objects are not identified at this stage.

The Attention Window

We can focus attention on different parts of an image without moving our eyes. The attention window serves as a gateway to the object-properties-encoding system (V4 in the lower temporal lobe). Attention window is directed to pick out a different region in the visual buffer. This suggests a stimulus-based attention shifting subsystem.The attention window can expand or shrink, helping us to identify objects at different distance.


Computer vision specialist David Lowe identified some half dozen image properties that are invariant under rotations, scale changes, and translations. Non accidental properties (affine properties?) If distinguishing features are missing, recognition becomes very difficult or impossible.

Opportunistic Encoding

The brain will encode whatever is available to recognize something—e.g., idiosyncrasies.

Motion Relations

Motion helps us delineate objects. Also, distinctive motions help us recognize particular objects. Areas MT and MST contain neurons that are sensitive to the directions and rates of motions, and also to patterns of motion such as expansion and contraction.

The Visual Memory

Visual Pattern Memory

Separate memories are used to store auditory, visual, tactile, kinesthetic, and other inputs. The author calls the visual memory the pattern activation memory because stored patterns are activated by visual inputs.

"The patterns can correspond to objects or to parts of objects, and the patterns will be activated in several ways; as will be discussed in the next chapter, mental images can be formed by activating these representations 'top down'."

The inputs are the input perceptual units which signal distinctive characteristics of objects and the inputs from the motion relations subsystem..

"The matching is made through a process of constraint satisfaction: If an object has two parallel lines a certain distance apart, and of such-and such a length, and the lines come to a point at one end (a pencil), there are only so many objects it can be. The challenge is a little like that of doing a crossword puzzle: Once enough letters are in place, there are only a few possible words that can be formed, even if a letter or two are missing. Lowe showed that this idea works well if the objects are familiar, and so relatively complete sets of non accidental properties (and their relative position) have previously been stored."

Uses non accidental properties and their relative positions. "The lower temporal lobe is involved not only in encoding object properties, but also in storing visual information." Also, there are neurons in the upper temporal bulge (superior temporal sulcus—STS—that defines the upper boundary of the area IT in the temporal lobe) that respond selectively to static views of the head (area IT is subdivided into the frontal—anterior— and back—posterior—areas. "Some of these areas have remarkably specific taste in stimuli, responding most strongly to the head in specific orientations or to the eyes in specific positions (different directions of gaze)." Other cells are tuned more broadly, responding strongly to the head or eyes when they are in a number of different orientations or positions. (Note that neurons will still fire under less than optimal conditions but not as readily.)

The output of this visual pattern buffer doesn't contain related information such as names, sounds, feel, etc. This must be derived when the output vector reaches an associative memory.

Identifying Partially Occluded Shapes

The principle of constraint satisfaction is the key to understanding this ability. This principle is relevant at two levels of processing. First, there is redundancy in the information encoded in the ventral system. For example, even if the edges delineating an eraser are missing, those that define its parallel edges and its point may be enough to identify the pencil. Distinctive motions alone are sometimes sufficient to recognize an object.

"Second, the information from the ventral system is combined with information from the dorsal system regarding location, size, and orientation of the stimulus. If the object were six feet long, we would not identify it as a typical pencil even it had the right sort of parallel edges and point at the end."

Associative Memory

If the visual feature match is unambiguous, then size, location, and orientation need not be considered. Otherwise, the brain must integrate spatial properties with object properties in the associative memory. Two other properties:

• The threshold may be adjusted depending upon the context. If one is looking for one's cat, the tip of a tail sticking out may be sufficient to identify it.

• The various properties of an object are not equally important. Three important attributes may identify it, or six less important attributes.

Identifying Unfamiliar Shapes

We can identify familiar objects (e.g., cats) even when their shapes do not exactly match the shapes of those objects (cats) we have ever seen. We recognize parts

(To be continued)


• Images are made up of things. Blue suggests sky, green suggests plant life. Look for trapezoids: suggests squares viewed in perspective. Look for vertical lines suggestive of buildings, trees, telephone poles. Look for near-symmetries. (Using range-finding info may permit assessment of size together with a determination of orientation. From this information, an object may be converted in a 3-D model, and then our vision system may correct for perspective effects to facilitate the recognition of the object.) Anticipated scenes and their associated aural cues (if any) will greatly facilitate recognition. Will normally be dealing with "stage settings". Image registration, magnification adjustment, and the determination of perspective effects when first entering an anticipated scene and in going from scene to scene should be relatively simple when dealing with stored "fuzzy models". Once the image as been registered, and the magnification and perspective established, the location of the camera within the scene should be a direct concomitant. (This would imply a geometrically-precise memory of the view from a doorway. However, in biological vision systems, our precise location isn't known to us—only an approximate idea of where we are, how large everything is, etc.) As the robot enters a scene, it can retrieve from stored information the 3-D model of the scene.

• Could store a generic library of common things—lamps, tables, chairs, toasters, etc. In fact, we'll have to store such a library. The only question is that of whether it will be generated through direct experience or whether it will already be in place. (It may be desirable at first to generate such a library through direct experience in order to place in it the kind of information which is generated by direct experience.) The kind of items expected in a scene would come out of prior knowledge. The 3-D model would be a model of expected objects within the scene. Only those objects which are unexpected would be given attention. (Objects which are expected but which have been moved from their standard locations would receive sufficient attention to recognize them.) Recognition would focus on details. For example, in a house, if there is molding around a base, experience suggests that it is built-in. This means that the inferential engine is quick to jump to conclusions, and equally quick to abandon them if experience invalidates them. Ignoring irrelevant details is a vital part of the recognition process.

• The mind must have the ability to recreate approximate scenes and to animate them. The kind of detail stored with a scene would be at a level that would remember a modest number of approximate colors (with real-time inputs standardized to the nearest approximate hue), approximate textures, approximate shapes, etc. Glint would be factored out.

• To calculate chrominance, determine the direction of the color vector; to calculate luminance, determine its magnitude.

Least-Demanding Scenario: Stationary Camera, All Objects Recognized.

We must detect changes as we go from frame to frame. We could compare each frame with the preceding frame pixel-by-pixel but noise might generate spurious differences. Even minute changes in camera position would require image morphing and might also cause irrelevant disparities at the edges of objects. In other words, pixel-by-pixel comparisons probably aren't feasible without some computationally-intensive translations, rotations, and scale changes. Even a pixel-by-pixel comparison would require as many subtractions and comparisons as there are pixels. A slight change in camera angle would require additions to or subtractions from the x or y coordinates of all the pixels. For 100,000 pixels @ 5 frames a second, we're looking at 500,000 pixel manipulations a second.


• Will have a 3-D map of all the objects in the scene. Will predict the location of the left-most vertical feature in the scene. Can use that to estimate...

Think again. The categories of operation are:

1. Everything stationary—nothing changing. (Must monitor for changes.)

2. Stationary camera, something moves within the field of view.

3. Everything's stationary but something changes slowly—perhaps something critical.

4. The camera's moving, looking straight ahead.

5. The camera's moving, facing a different direction.

6. The camera's stationary but rotating.

7. The camera's translating and rotating.

8. Entering a new environment.

• When the camera rotates, it probably can't rotate faster than 60° in a fifth of a second. This would imply a maximum rotation rate of 10° per frame. That means that only 10° of the scene would have to be compared against the stored 3-D environment every 1/30th of a second. Similarly, when entering a new scene, entry motion will be at about an inch a frame.

• We might handle rotation by drawing upon the 3-D model of known objects. Those objects that are in the 10° field will have to be translated, rotated, scaled, and projected to fit the camera's perspective viewing angle. Colors and textures can be provided for comparison with the camera's live image. Alternatively, the outlines, colors and textures of the objects in the camera's live image can be used to generate the rotated views of objects as long as rough identifications have been made with the objects in the corresponding 3-D model. (Calculating the apparent locations and apparent sizes of objects in the 3-D from the camera's perspective should be relatively easy. The computationally-demanding part would seem to be calculating their cross-sections and eliminating their hidden lines.)

• Rather than working from left to right, we might begin by examining the high-resolution center of the image. If the camera is translating directly forward, objects would be expanding but not translating or changing shape.




• Will begin differencing the peripheral view from the center outward. Will look for vertical and near-vertical lines because vertical lines are undistorted by perspective. Will store the lengths and magnitudes of the lines.

• Will have the chrominance and luminance ranges from the last scene for purposes of anticipation.

• Will attempt to register vertical lines from this scene with the vertical lines from the last scene.

• Could imagine a lens which would spread out the image at the center, while compressing it around the periphery.

• For an auto mower or an automap, resolutions, angular field of view, and frame rates could be lower than for an animal eye. This would lower the cost of the video camera as well as the processor speed requirements. Trees, bushes, driveways, sidewalks, buildings, cars, people, cats, dogs, flower beds, bicycles, tricycles, dolls, toys, people, cats, and fallen tree branches could be built-in objects, so that the automower could be programmed to quickly recognize them once placed in a new environment. The edges of the yard and of forbidden zones (e.g., flower beds) would have to be pointed out to the automower's powerhead which would store them in memory as forbidden coordinate boundaries. The powerhead could be outfitted with rubber-tired odometer or optical encoder wheels on each side so that it could navigate by dead reckoning. (Probably, though, the odometers would add unnecessary cost to the device, since dead-reckoning could probably be a part of the visual navigation.) The precise angles between landmarks as well as the angular locations of the landmarks themselves could be used to precisely pinpoint the location and orientation of the powerhead. The automower could have some ability to circumnavigate obstacles. It might also have LEDs between a comb to count uncut grass blades a la the Lawn Ranger. The teeth of the mower could be placed closely enough together Objects which moved during the course of a mowing or mopping would be considered transient and would not be stored as part of the visually-measured terrain map.

Perhaps many of these receptors are dedicated to detecting straight lines and certain other types of features

Simplifying the Problem:

(1) Our visual world is comprised entirely of objects. Even the sky and grass may be considered to be objects.

(2) At any given time, we know where we are .Furthermore, at almost all times, the objects within our field of view are known objects and have already been recognized.

Only when we are presented with a picture or enter an unfamiliar enclosure do we face the problem of recognizing an environment from scratch. Consequently, as we process video scenes by frame, here is very little that changes from one frame to the next that can't be predicted from the known velocity and camera rotation rates. It should then become a relatively simple matter of matching the predicted locations of landmark (high contrast, immobile) objects with their measured locations actual in each new frame.

(3) We store a crude 3-D model of places we have experienced, as well as the view of these models we have experienced from doorways. A footstool out of place or magazines on a table or unexpected occupants would be the common variations which would require rapid recognition.

Changes in color are more significant usually in distinguishing the edges of objects than are changes in illumination (which may be caused by shadows). Consequently, we may wish to base our recognition more on changes in chrominance than on changes in luminance. edge detection Separation of chrominance and luminance may be effected by converting the three primary color components to spherical polar coordinates. upper half-space three red, green and blue the of the three color components 2,000,000 we could for the arc tangents

looking for vertical or near-vertical lines, Will then seek lines which attach to the vertices of the vertical or near vertical lines.



Pan and Tilt

The two small, lightweight cameras would be panned and tilted by high-speed servos controlled by software which would optimize the slewing of the cameras to any given angular direction with a minimum of settling time. This would approximate the ability of the human eye to direct its gaze to any desired location. Because the foveal region of maximum resolution would be larger for a robotic vision system than it is for the human eye, precise direction of gaze toward an object of interest wouldn't be as critical as it is for the animal eye.

Navigating through/Viewing a Known environment:

The robot will draw upon the stored 3-D model The robot 's vision subsystem will draw upon an approximate model of the environment in which it is functioning. Based upon known rates of camera motion, it will generate a perspective view of the predicted environment a 30th of a second later than the last frame. This can be done crudely, using the stored 3-D model with simple ray-tracing at the edges of "framing objects". By keeping a record of those areas in which ray-tracing occurred in the last frame and using linear estimation for a few frames, it may be possible to simplify the computation. Objects which begin to move into the field of view simplify the computation. Previously-hidden objects may be anticipated from prior experience or possibly from the layout itself. Things should be done as crudely as is acceptable to minimize the computational load. Then the current image itself may be projected as a texture map onto the perspective-adjusted 3-D model. It may be desirable to compare the current frame with the preceding frame pixel-by-pixel but only approximately, because pixel-by pixel matches won't be feasible. Once the matching has been carried out, trying to match the model with the actual scene, the position and orientation of the camera may have to be corrected as a result of the matching.

If a significant discrepancy occurs between the expected and observed images, the vision system and the ,,, Or it may be appropriate to match scenes object by object, using curve fits, texture maps, and color matches to make the comparison. The actual image would be flagged for recognition and attention and of the unexpected object entering the field of view some part of to examine recognize update and I

The vision system will These will be matched with the predicted lines and the predicted image vertical in The new or transplanted object would be added to the data base-D model as part of a stored "action sequence" 3-D projection. Colors, textures, and outlines will also be matched with their predicted values, as reinforced by the receding scene.

Range Finders

A Polaroid ultrasonic range finder would probably be a useful adjunct to the vision system, permitting the construction of 3-D models without stereo vision.

If necessary, the vision system can cut the frame rates at which it operates, using linear or curvilinear predictions to extrapolate recognizes or predicts near predictions to skip frames. It might also process peripheral vision at lower resolution and central vision over a smaller area to cut the computational workload extrapolate intermediate during routine operations. Recognition of all of the objects within a field of view may not be necessary. What is necessary is the detection of moving objects or of the grossly unexpected.


Visual Storage Requirements:


(2). What we are actually store are crude and approximate 3-D models, coupled with sound bites, feelings, and action sequences.

Vision is an active process. We determine what we will examine

Two ways in which animals use vision are:

(1) to identify objects in the surrounding environment, and

(2) to determine location in that environment—i.e., navigational fixes. and we seek out what we see emulate

It is debatable just how much recognition needs to occur, or just how much recognition


How do we store thousands of faces (including public personalities)?

• We don't store all those faces very accurately. We catch a flavor, an essence.




The Data Base Management System:

Our filing system must create its own categories. It should be highly dynamic, reorganizing itself as it goes along.

Shapes should be analyzed and somehow categorized parametrically. Category boundaries would be established more or less statistically as experience builds up. If something falls squarely within a given category, together with related identifying criteria, then recognition should be relatively simple.

Relationships (pointers) should be weighted according to (a) # of repetitions and (b) depth of meaning. Weights would be assigned (a) deliberately, inversely by the number of repetitions, and (c) by the level of significance or trauma. Weights would decline with time and disuse according to some formula. Probably want to store related objects together and to store them with their environments. If we see railroad tracks, then engines, rolling stock, depots, baggage wagons, handcars, crossing gates, block signals, and so forth would be a part of that environment. Train whistles, diesel horns, chuffing, the rumble and hum of a diesel, the sight of steam, the smell of coal and diesel fumes, train bells, etc., would suggest train locomotives. Brief sound bites would be available for comparison, just as images of locomotives would be present.

Unique features, like a locomotive headlight or the bolts around the front boiler cap, would be suggestive of a particular object (in this case, a steam locomotive).

One of the problems is that of recognizing where a complex object starts and stops. Moving-as-a-unit is one important clue. Continuous texture would be a clue. Contiguity would be another requirement.

• Would store by environment.

• Would store silhouette.

• Would store shapes of components (features).

• Shapes, though stored in their individual environments, will be entered into a master address list.

• Shapes will be classified according to similarities to minimize search time in the master address list. There will be a shape table.

• Shapes will be described loosely and parametrically, and will be loosely compared.

• Feature extraction is very important.

• Weighted pointers. Weights will fade over time.

• Color(s), texture(s) will be discriminates.

• Feelings

• Feel

• Sounds

• Action tracks (historical)

• Thermal

• Kinesthetic

In recognizing an object such as a violin, the key features would seem to be shape, stock, size, color, opening, strings, bow, thickness, and bridge. Thickness and size would determine whether we would classify it as violin, a bass, or a viol. Since we would still recognize it if it were a small copper model without a stock or any other violin-type features, the silhouette is a critical recognition factor. Normally, though, it will be a 3-D object and the system will have to recognize it from a perspective view.

Perspective Views

Recognizing objects seen in perspective represents a special recognition problem. Presumably, the vision system will rotate the 3-D model of the object until its surfaces are parallel to, or perpendicular to the camera's viewpoint. Only then can it extract the correct silhouettes (cross-sectional views).Of course, amorphous objects won't generally lend themselves to x-y-z projections. Perhaps a better strategy would be to extract the largest silhouette that one can abstract from an amorphous object. One has to note, however, that a cube would show its largest silhouette when it's tilted so this strategy wouldn't work for rectilinear objects.


Clustering and Local Environments:

Objects like door hinges and door knobs are clustered together with doors. When you see a computer, you on the lookout for printers, modems, diskette cases, auxiliary disk drives, etc. Some objects tend to go together with other objects. Also, objects tend to be appropriate to specific environments. You don't really expect to see a miniature sleigh and eight tiny reindeer on a rooftop or a stop sign in the middle of the woods. You would do a double-take if you saw it. You would still recognize it but it wouldn't be unreasonable to expect it to take a little longer than it would if the object were in an appropriate environment. This appropriateness criterion can greatly reduce recognition times, while also developing the appropriate associations among objects. We will search a hierarchy of local-cluster shape tables as we work our way to more general shape tables.

Parametric Classification:

New shapes might automatically be parametrically classified and stored in a shape table in the order of their classifications, changing shape gradually as one moves through the table. (They might also be stored in terms of their actual, geometrical shapes for refined comparisons.) Shapes may be described in terms of a connected sequence of standard curves, together with parameters that describe the curve. For example, an arc of a circle would be described by eight parameters and seven pairs of limits: its identification number, its radius (relative to the size of the object), the number of degrees in the arc, its three-space orientation (two angles), the three position coordinates of its end point, and the allowable ranges of these parameters, based upon experience, within which we still will recognize the shape. An elliptic curve segment would be described in terms of ten parameters and nine pairs of limits: its identification number, its relative size, its eccentricity, its starting angle, its terminating angle, its 3-space orientation, and its three position coordinates. A shape could be defined as an ordered sequence of connected curves. Of course, this involves storing and comparing 28 numbers. The alternative is to store 2 curves defining the boundaries of allowed shapes, and then checking the test curve to see whether it lies within the volume generated by the two bounding curves. However, this entails calculating the maximum and minimum values of the coordinates

When it comes time to recognize a shape, some basic classification calculations will be performed to pre-classify the object. A lot of effort should go into this pre-classification—feature extraction, etc. Then the search will begin in that section of the shape table that would contain shapes of the proper type. The comparisons would now take the form of comparisons of curve identification numbers and their parameters, measured against the identification numbers and allowable ranges of parameters for the shapes that are already in the generic shape table. Computers can compare simple numbers faster than making least-squares curve fits to actual curves. One might use Gaussian tails at the limits of the parametric ranges to allow some stretching of the bounds if other important recognition criteria were satisfied. For example, suppose that the robot found an ornate, square, glass doorknob like nothing it had never seen before in the place where one would normally find a doorknob. The square doorknob should still be recognized as a doorknob even though it weren't round. Of course, this would add ornate square glass doorknobs to the list of anticipated doorknob shapes in the local shape table.

The shape tables could also reference existing, complicated shapes which are defined by more primitive curves. Do autosophic concepts apply?

Clouds, with no well-defined edges, could present another challenge to the vision system.

Whoa! Back up. The most important part of the recognition process may be context-sensitivity or relationships among objects. We expect to find certain objects associated with other related objects. Outdoors, we expect to find buildings with shingled roofs. A shingle texture outdoors would be strongly-suggestive of a house, a garage, or a shed. We wouldn't need to know what the shingle texture meant. All we would need to know would be that it were associated with a certain type of large object. The significance of the shingled roof as an identifier as opposed to, for instance, color or size could be derived from the fact that it occurred so commonly with buildings, whereas color and size varied from building to building. (The statistical significance of a shingled texture as an identifier could be developed by keeping track of the degree to which associations were invariant. This could perhaps be one method of calculating weights to be associated with relationships.) When I saw what appeared to be a lamp on a pole, I inferred the shape of the lamp from prior observations and I assumed that it was a lamp because it was in the right place and the right environment for a lamp. I didn't bother to check it more closely.

Instances of objects would be related on the basis of spatial proximity, temporal concurrency...

Along with the associated objects, we probably also want to store short, related, action sequences. With a generic door, we might want to store a sequence depicting the opening and closing of the door. (To pass a closed door, the robot grasps the door handle with its manipulator, turns it, and pushes or pulls on the door. That action sequence opens the door, without the robot's necessarily understanding what is going on.)

• Note that objects can have membership in more than one associative grouping. A computer keyboard might be associated with a generic computer, as well as with a certain room.

• If the object is easily recognized—brown wood finish, the shape of a violin, about the right size, has a stock, strings, a bridge, an opening—then the identification of "violin" will readily occur—the search can be a cursory one. If some parameters don't exactly match—the color is unusual or it's too large—then recognition may take a little longer. The vision system might reach out beyond the generic pattern of a violin to the specific instances of violins that originally defined the generic shape of a violin. Something like the above copper model or a violin-shaped marking would be so general, with only part of the shape matching that of a violin, that it would be brought to the level of awareness for conscious evaluation and decision. We might want to use a "recognition score" to decide whether recognition is to occur.

A violin shape is what's distinctive about a violin. Other stringed instruments have strings, bridges and stocks, but only violins and their kissing cousins, the viols, have a violin shape. There must be an important message here.

What are the distinguishing features? They're the features that are found only in conjunction with a certain generic object, like the features of a steam locomotive. Other vehicles don't have steam whistles or pistons or connecting rods or a ring of bolts around the front of the engine. Related objects are generic tenders, block signals, lighted switches, siding bumpers, train stations, baggage carts, coal tipples and water towers, and tracks and ties. Related events would be experiences with steam locomotives. The sounds of a steam engine would also be related objects/events. One way to do this might be to form local networks of related associations and then search along the dendrites of the network.

One wouldn't expect that these categorizations would be categorical but would fade gradually into one another.

Setting the size ranges that distinguish between a violin, a viol, and a bass viol will be a matter of experience. Initially, anything that looked like a violin, waddled like a violin and quacked like a violin would be classified as a violin. This means that the shape table would have to be updated from information gained by the robot at the conscious level. Someone is going to have to explain to the robot that something that looks like a violin but is significantly bigger than a violin is a viol. And something bigger yet is a bass viol.

How do we recognize the man-in-the-moon?

• The object will be reshaped and, if necessary, re-sized on a snap-to-grid basis in order to carry out a comparison with the stored generic objects in the shape table.

• Shape will be given a higher priority in object recognition than other attributes such as size and color. If there is even a partial shape match of a distinctive silhouette in one of the dimensions, a new entry might be made in the shape, feature, and event table but there would be pointers between the original shape and its new derivative—e.g., the purple violin-shaped marking on the back of a brown recluse spider would be associated with the violin shape. This gets us to the question of how the robot's little mind decide what silhouettes are distinctive. The answer to that may lie in the fact that a distinctive silhouette, or a distinctive characteristic in general (for instance, a distinctive sound) is one that is not shared with any other generic object.

Problem: we recognize the sound box part of the violin shape on the back of the brown recluse without requiring a match of the whole violin silhouette—e.g., the stock.

We note that the stock is not distinctive when viewed from above, although it is when viewed from the side (with its curled end for tucking under the chin). Stocks are present on all stringed instruments. But how would we compare only the unique sound box and not the entire violin?

Answer: This might occur if we attempted to compress data autosophically—that is, if the computer, operating at night, were to play with the recordings of the day, dissecting them into sub-silhouettes, etc., and then trying to find similar curves in the archives.

Need to store with bilateral and other symmetries taken into account.

The brain operates with

3-D measurements:

Can detect 3-D perspectives by measuring z-axis distances. Using a known baseline—e.g., three inches—a point 14 feet away will subtend an angle of about 1° with an error of ±1 arc-minute (1/60th degree), corresponding to an uncertainty in distance of ±3 inches. At one arc-minute of resolution, maximum range would be about 840 feet. Can be used to determine whether, for example, a trapezoid is really a 3-D rectangle. What the robot will do is move a few feet, maintaining edge registrations inch by inch as it moves, and then trigonometrically calculating the locations of various edge points.

Might need stereo cameras to distinguish between object motion and camera motion.


May want to calculate both horizontal and vertical differences simultaneously. Will note intensity of differences. Will establish vertical line file and horizontal line file. Will check for continuity as the computer works its way across the image. Will try to fit curves to the lines. For images whose only purpose is navigation, will seek predicted locations of waypoints and then update (replace) the dead reckoning values with the observed values. For other objects in the scene, the computer will try to correlate the predicted locations of objects within its neighborhood with the actual live scene. It can then detect edges in the live scene to update its three-dimensional perspective model, perhaps calculating corner points to validate the perspective view which it is witnessing, but accepting.

Order objects in the central FOV first.

Familiar scene

Unfamiliar scene

Moving, camera straight ahead, scenery streaming, turning, & enlarging.

Fixed, camera turning — translation of scene, new material

Moving, camera turning

Rotation will be 5° to 10° per frame and will be a system driver. (Lighting may change)

Don't forget imitation

How It Might Work:

Suppose we present the robot with a set of "identical" blocks. The first block the robot encounters will be associated with everything that is going on at the time. When it encounters the second block, the associated positions, simultaneous events, etc. won't be the same so it will eliminate them from the model of the object. However, the connections between the blocks and the room and also between the blocks and the animation track will remain. The block will be perceived as a part of the day's animation track. As additional experiences build up that have the block as the only common element, the block will be seen as a kind of time-invariant, geometrical entity with a generic existence. In that case, the robot would generate a generic block which" would be a composite of the identical, individual blocks. However, it will still retain a 'file" for each individual block. Now suppose that the robot encounters blocks that are not exactly like the identical blocks. In that case, the robot will broaden its definition of a block until the definition encompasses all similar blocks, but it will still retain its previously generated files.

Function will play an important role in differentiating among diverse objects.

Question: how does the robot decide when a block is still a block and when it becomes something else? How about this? When the robot encounters something else, like a candle, it will broaden its definition to include that, too, while still retaining its previous classifications. However, the silhouette for a candle should be sufficiently different from the silhouette for a block that some differentiation should occur based upon shape(?).

Furthermore, when the robot picks up a block, how does it separate the block from its manipulator?

Answer: At least at the outset, we'll make the manipulator transparent.

Emotions must play an important role. Some of there behaviors must be instinctive. For example, a dog will sniff out its world but a baby won't. Curiosity itself must be built-in. The tendency to grasp must be built in. How does the robot establish a connection between the way it moves and what it's trying to do? How does the robot know where and how to move its manipulator?

The robot will be rewarded when it infers a relationships. It will be rewarded again when it tests its relationships and they are verified. The reward state will wear off very soon but will return later, to lesser and lesser degrees. The robot will be rewarded for gaining power over its environment.

Suppose that we start with a single block. The robot will be programmed to pick up the block , examine it, integrate the new information with the existing information, and store it. The robot will be excited. It will abstract the shape of the cube, its weight, its rotational inertia, its color, its texture, its solidness, and whatever other observations/events occur at the same time. Now if a new identical block appears, it will pick this up and examine the block. The stored description would be identical. Would it re-examine the first block to take a better look—compare the second block with the first block? What principle would apply?

The concept of number might begin here.

Its exploratory strategy should be altered by its experiences.

In order to accomplish this, we'll need a hierarchy of strategies and of optimization criteria.

We can give it a built-in strategy that it can subsequently improve, on the basis of experience.

Now suppose that we presented it with a different-colored block that rattled when shaken. We would expect the robot to be fascinated by the rattling block. Then we would expect it to test other blocks to see if they also rattled.

We remember snapshots and video bites. (The U. N. building, the Indian museum in Destin.) If we want to store a shell model, then we need to store the viewing angle with it.

I remember that the gift shop at the U. N. building was downstairs and that the stairs were broad and shallow. I have snapshots of the gift shop.

We can loosely reconstruct sequences, so there is some timeline left, but details are degraded rapidly at first, and then more slowly with the passage of time.