Document Type


Publication Date



This survey presents a review of state-of-the-art deep neural network architectures, algorithms, and systems in speech and vision applications. Recent advances in deep artificial neural network algorithms and architectures have spurred rapid innovation and development of intelligent speech and vision systems. With availability of vast amounts of sensor data and cloud computing for processing and training of deep neural networks, and with increased sophistication in mobile and embedded technology, the next-generation intelligent systems are poised to revolutionize personal and commercial computing. This survey begins by providing background and evolution of some of the most successful deep learning models for intelligent speech and vision systems to date. An overview of large-scale industrial research and development efforts is provided to emphasize future trends and prospects of intelligent speech and vision systems. Robust and efficient intelligent systems demand low-latency and high fidelity in resource-constrained hardware platforms such as mobile devices, robots, and automobiles. Therefore, this survey also provides a summary of key challenges and recent successes in running deep neural networks on hardware-restricted platforms, i.e. within limited memory, battery life, and processing capabilities. Finally, emerging applications of speech and vision across disciplines such as affective computing, intelligent transportation, and precision medicine are discussed. To our knowledge, this paper provides one of the most comprehensive surveys on the latest developments in intelligent speech and vision applications from the perspectives of both software and hardware systems. Many of these emerging technologies using deep neural networks show tremendous promise to revolutionize research and development for future speech and vision systems.