Xiaomi Robotics 0 VLA Model launches for advanced open source robotics and physical intelligence development

Xiaomi launches Xiaomi Robotics 0 an open source 4.7B parameter VLA model combining visual perception and language for advanced physical intelligence.
Xiaomi Robotics 0 VLA Model launches for advanced open source robotics and physical intelligence development

Xiaomi Unveils Xiaomi Robotics 0 VLA Model for Advanced Robotics

Xiaomi has launched its first Vision Language Action (VLA) model which is called Xiaomi Robotics 0. The open source model which operates with 4.7 billion parameters aims to develop physical intelligence. The system combines visual perception with language comprehension and real time physical execution into one unified system.

The Xiaomi Robotics 0 model operates on a closed loop principle consisting of perception, decision, and execution. The system divides its operations into two main functional components

  • Visual Language Model (VLM) The cognitive center of the system interprets high resolution scenes while it analyzes commands. The system can process vague instructions which enable it to identify objects while making logical deductions.
  • Action Expert The block uses a multi level Diffusion Transformer to control all physical movements. The system creates complete action sequences which maintain motion continuity while establishing movement accuracy.

The training process used different phases to connect scene comprehension with motor capabilities. The VLM starts by predicting action distributions which it derives from visual information. The Diffusion Transformer learns to create movement sequences by analyzing crucial environmental elements instead of processing basic text tokens. The method allows the robot to execute physical tasks while maintaining its cognitive abilities.

Xiaomi designed multiple technical improvements to achieve better operational efficiency by minimizing inference delays which caused operational downtime.

Asynchronous Computing The system enables continuous movement execution during model operations when complex calculations need additional processing time.

Clean Action Prefix Technology The technology enables the robot to refine its physical trajectory through the process of re feeding previous actions into the model.

Attention Masking The model improves its visual focus to direct attention toward immediate visual inputs which enables it to respond faster to changes in its surroundings.

Xiaomi Robotics 0 achieved better benchmark results than 30 competing systems which included LIBERO CALVIN and SimplerEnv. The testing involved a dual arm robotic platform which performed real world operations. The robot successfully executed complex, multi stage tasks such as folding towels and disassembling construction sets which demonstrated its ability to handle both rigid and flexible objects effectively.

About the author

mgtid
Owner of Technetbook | 10+ Years of Expertise in Technology | Seasoned Writer, Designer, and Programmer | Specialist in In-Depth Tech Reviews and Industry Insights | Passionate about Driving Innovation and Educating the Tech Community Technetbook

Post a Comment