AI-driven technologies are instrumental in helping humans with mundane or dangerous tasks. To perform such tasks, robots should be able to interact with the environment using a strong perception system. Much like humans, intelligent agents can use visual and auditory perception to know the environment. Despite great progress made in visual systems, today’s robots have still basic auditory perception capabilities. This comes from lack of a strong sound source localization system which in turn results from insufficient knowledge about sound wave propagation in cluttered and noisy real-life environments. Robots lacking effective auditory systems are unable to handle unanticipated situations and are largely incompetent when it comes to collaborating and interacting with humans. Auditory perception can outperform visual systems in dark or cluttered environments. In cluttered environments where a camera’s field of view is blocked by physical structures, the emitted sound by non-line-of-sight (NLOS) targets can travel around the physical obstacles due to reflection/diffraction. This remarkable quality of sound can essentially help disaster robots to find NLOS victims and most importantly, it can realize the human-robot interactions across physical structures. The first objective of this research is to establish approaches to model the sound propagation in complex real-life environments and realize effective and reliable sound source localization (SSL) techniques. In addition to SSL, an advanced auditory system should be also able to classify the sound sources and extract the individual sound sources from a mixed signal. Establishing effective sound source classification is the second objective of this research. If successful, this research will significantly contribute to the advancement of reliable and robust auditory perception for AI-driven systems, from UxVs (e.g., disaster robots) to IoT devices.