








The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing Fudoki, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize Fudoki from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that Fudoki achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models.
Qualitative Results of Visual Generation and Understanding Capabilities of FUDOKI. Fudoki is designed based on the framework of discrete flow matching for both visual and textual modalities, capable of performing understanding and generation simultaneously under one unified paradigm.
Generation process of different methods:
(a) AR-based Janus: Can only generate tokens sequentially; if an error is made in the initial step, subsequent outputs will consistently propagate this mistake.
(b) D-DiT (mask-based discrete diffusion, MDD): Cannot revise tokens once unmasked, making errors irreversible and leading to poor generalization.
(c) FUDOKI (discrete flow matching, DFM): Allows generated tokens to be revised in subsequent steps, enabling step-by-step reasoning and error correction for more accurate answers.