Researchers are exploring the use of images and videos to train robots for tasks traditionally difficult with existing methods like Human-in-the-loop (HITL) and large language models (LLMs). This approach leverages visual data to enhance robots' understanding of tasks, enabling them to perform more complex actions with greater adaptability. Projects like Genima and Dreamitate demonstrate improvements in task execution and generalization, suggesting a significant shift toward smarter, more versatile robotic systems.