FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

Anonymous Authors

Anonymous Institution
2025

Abstract

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing Fudoki, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize Fudoki from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that Fudoki achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models.

🖼️ Image Generation

A red colored car.

Rainbow coloured penguin.

Hyper-realistic photo of an abandoned industrial site during a storm.

A loft bedroom with a white bed next to the bedside table

Eiffel Tower, large aperture, blurred background

The sunset is at the end of the sky and the sea.

A horse running on the beach at sunrise

A rabbit wears a blue scarf.

Automobile design drawings, sketch

💡 Image Understanding

👁️‍🗨️ More Results

Qualitative Results of Visual Generation and Understanding Capabilities of FUDOKI. Fudoki is designed based on the framework of discrete flow matching for both visual and textual modalities, capable of performing understanding and generation simultaneously under one unified paradigm.

🔄 Text Generation Process Comparison

Generation process of different methods:

(a) AR-based Janus: Can only generate tokens sequentially; if an error is made in the initial step, subsequent outputs will consistently propagate this mistake.

(b) D-DiT (mask-based discrete diffusion, MDD): Cannot revise tokens once unmasked, making errors irreversible and leading to poor generalization.

(c) FUDOKI (discrete flow matching, DFM): Allows generated tokens to be revised in subsequent steps, enabling step-by-step reasoning and error correction for more accurate answers.