01:02:52
Stanford's CS231N, Deep Learning for Computer Vision, Spring 2025, begins with a profound exploration of how seeing shaped intelligence on Earth and how we are now teaching machines to see. This lecture, delivered by Professor Fei-Fei Li and Professor Ehsan Adeli, lays the groundwork for understanding one of AI's most transformative fields.
A key turning point in modern AI was the realization that data is a first-class citizen in machine learning. High-capacity models like neural networks require massive, curated datasets to generalize effectively, a lesson cemented by the success of the ImageNet challenge.
The quest to understand visual intelligence begins not with computers, but 540 million years ago during the Cambrian Explosion. Fossil evidence suggests this relatively short period saw an explosive diversification of animal species. A compelling theory posits that the development of photosensitive cells—the first primitive eyes—was a primary driver. This simple ability to collect light transformed life from passive metabolism to active participation in an environment, fueling an evolutionary arms race that drove the development of nervous systems and intelligence.
Vision remains a primary sense for most animals. In humans, it is particularly dominant, with over half of our cortical cells dedicated to visual processing. This biological reality underscores why solving computer vision is fundamental to unlocking artificial intelligence.
The human ambition to build machines that see is centuries old, with thinkers like Leonardo da Vinci studying the camera obscura. The modern field of computer vision, however, traces its origins to the mid-20th century.
Seminal neuroscience experiments by Hubel and Wiesel in the 1950s revealed two critical principles of the mammalian visual cortex:
This hierarchical model profoundly influenced later artificial neural network designs. The field's first PhD thesis, by Larry Roberts in 1963, focused on understanding 3D shapes from 2D images—a core, ill-posed problem of vision that nature solved with multiple eyes and brain processing.
Despite early optimism, progress was slow. The field entered an "AI winter" as initial enthusiasm and funding dwindled due to the failure of early systems to deliver on ambitious promises. Research continued, but focused on narrower problems like edge detection and feature matching (e.g., SIFT features).
While computer vision progressed, a separate strand of research on artificial neural networks was evolving. Early work on perceptrons faced setbacks but led to key innovations:
Yet, these models stalled on more complex visual recognition tasks. A critical ingredient was missing: large-scale data.
The turning point came with the recognition that data is central to driving high-capacity models. This led to the creation of ImageNet, a massive dataset of over 14 million images across 22,000 categories, designed to reflect the number of object categories a human learns to recognize.
The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) became a benchmark for progress. For years, error rates remained high. Then, in 2012, a team led by Geoff Hinton entered a convolutional neural network model called AlexNet. It reduced the error rate by nearly half, a stunning improvement that showcased the power of combining deep learning architectures with massive datasets and sufficient compute (GPUs).
The 2012 ImageNet victory is widely considered the birth of the modern deep learning revolution, igniting an explosion of research and development in AI.
Since AlexNet, the field has rapidly advanced beyond simple image classification. The course will cover a wide spectrum of modern visual tasks and models:
The field now encompasses generative AI (e.g., style transfer, DALL-E, diffusion models), vision-language models, 3D vision, and embodied AI for robotics. These advancements are powered by the converging forces of algorithms, data, and computation, with hardware like NVIDIA GPUs seeing exponential growth in FLOPs per dollar.
With great power comes great responsibility. The lecture highlighted critical challenges for the field:
Despite progress, human vision remains remarkably nuanced, emotional, and complex, indicating that the journey to true visual intelligence is far from over.
Co-instructor Professor Ehsan Adeli outlined the course's structure, which is designed to provide a comprehensive foundation in deep learning for computer vision. The curriculum is built around four pillars:
The course will combine fundamental theory with practical assignments, including implementing a generative model for creating emojis from text prompts. The goal is to equip students to formalize vision tasks, develop and train models, and understand the field's trajectory.
Computer vision is more than just a subfield of AI; it is a cornerstone of intelligence itself. This course offers the tools to understand and contribute to one of the most dynamic and impactful areas of technological development today.