Technologies that can recover information about hidden scenes have uses in search and rescue operations, law enforcement, fall detection for the elderly, and in helping self-driving cars locate hidden pedestrians. One such method was recently developed in a collaboration between NVIDIA, MIT, and the Israel Institute of Technology that is able to determine what unseen people are doing by capturing video of a nearby blank wall. Unlike previous solutions to this problem, this device does not rely on known occluders or controllable light sources, which allows it to be generalized to any real-world scenario without calibration. This is more than just a nice-to-have feature — it is the difference between a technique that never sees the light of day outside of a research lab, and one that has real utility in everyday situations.
This new method leverages the complex variations in indirect illumination that occur between people and the room over time. These variations can be incredibly small, and imperceptible by both human eyes and traditional methods of image amplification.
Device setup (📷: P. Sharma et al.)
The signal to noise ratio is very low, so as a first step in the pipeline, unchanging elements in the video stream are subtracted out. By doing so, a stronger signal representing anything in motion can be passed onto the next step in the process, which is classification. Two independent convolutional neural networks were constructed — one that is able to determine the number of unseen people present in a room, and the other capable of inferring what activities these people are engaged in.
The machine learning models were trained on datasets collected from twenty different scenes. The scenes included zero, one, or two people that were either walking, jumping, waving their hands, crouching, or standing still. These models were then tested against five previously unseen scenes and were found to have an accuracy of 94.4% in classifying the number of people present, and 93.7% in determining what activities they were engaged in.
When running on a six-core laptop with an NVIDIA RTX 2080 Max-Q GPU, the classifications were able to take place in near real-time, with fifteen predictions being made per second. No data needs to be transferred to a data center for processing. As such, this is a system that could theoretically be used in the field by emergency personnel, for example.
There are still some obstacles to overcome before the technique can be used under all circumstances, however. It is not very effective under low-light conditions, or when there are many irregular variations in lighting, such as when a television is on in the background. Issues also pop up when there is a long distance between the subjects and the wall under observation. These issues notwithstanding, this is an important proof of concept work that has the potential to make a meaningful impact in many everyday situations with just a bit more refinement.