Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

arXiv 2026
Zengqun Zhao1 · Yanzuo Lu2 · Ziquan Liu1 · Jifei Song3 · Jiankang Deng2 · Ioannis Patras1
1Queen Mary University of London · 2Imperial College London · 3Huawei R&D UK

TL;DR

Relax Forcing improves long-horizon autoregressive video generation by replacing dense full-history attention with a structured memory design: Sink for stability, Tail for short-term continuity, and selected History for motion guidance. This yields better temporal dynamics and consistency on VBench-Long while reducing attention overhead and improving scalability.

Abstract

Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training–inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.

Method

Method Overview
Figure 1. Overview of Relaxed KV Memory. Instead of retaining dense chronological history, temporal memory is decomposed into three functional components: Sink for global anchors, History for intermediate motion structure, and Tail for recent continuity. During generation, candidate historical frames are dynamically selected to remain aligned with Sink while avoiding redundancy with Tail. The selected memory is then integrated through a relaxed KV formulation with adjusted relative positional encoding, enabling the model to leverage non-contiguous temporal context while preserving long-range consistency during autoregressive rollout.

Qualitative Video Comparison

Comparison with Baseline

An astronaut runs smoothly and appears almost weightless on the lunar surface, as seen from a low-angle shot that highlights the vast, desolate background of the moon. The moon's craters and rocky terrain are clearly visible, creating a stark contrast against the running astronaut who moves with graceful, fluid motions. The background features a muted, grayscale texture with subtle shadows and highlights, emphasizing the lunar landscape's rugged beauty. The astronaut wears a classic spacesuit with reflective fabric, adding to the sense of lightness and movement. A dynamic medium shot capturing the astronaut's forward momentum.
Baseline
+ Relax Forcing
An aerial shot of the ocean, capturing a mesmerizing maelstrom forming in the water, swirling violently before revealing the fiery depths below. The water churns with intense energy, creating a whirlpool effect that stretches from the surface to the murky depths. The swirling currents illuminate the underwater landscape, showcasing a vivid array of colors and textures, as if the ocean floor is alight with hidden fires. The camera angle provides a dramatic overhead view, emphasizing the dynamic motion and the vastness of the ocean.
Baseline
+ Relax Forcing
An astronaut running through a narrow alley in Rio de Janeiro, Brazil. The astronaut is dressed in a bright white spacesuit with a helmet that reflects sunlight. The spacesuit is adorned with various technical patches and has a reflective texture. The astronaut's movements are energetic and dynamic, with one hand on their hip and the other reaching forward for balance. The background features colorful street art, vibrant buildings, and people bustling about. The alley is dimly lit, with shadows cast by the narrow walls. A mid-shot with the astronaut running from a low-angle perspective, capturing the excitement and contrast between the urban environment and the space exploration gear.
Baseline
+ Relax Forcing
A winter scene in a snowy forest, where a litter of playful golden retriever puppies emerge from the snow. Their heads pop out, their fluffy fur glistening in the sunlight, and they wag their tails joyfully. They are covered in snow, with some paw prints leading away into the deep snow. One puppy is burying its nose in the snow, while another chases a small ball that has rolled nearby. The background shows dense evergreen trees and a gentle slope leading up to a clearing. The air is crisp and cold, with tiny snowflakes falling gently. A close-up shot from a slightly elevated angle, capturing the lively and energetic moment.
Baseline
+ Relax Forcing
An ultra-fast disorienting hyperlapse photograph capturing a car racing through a tunnel, transitioning into a chaotic labyrinth of rapidly growing vines. The car's headlights illuminate the tunnel walls, which are adorned with peeling paint and graffiti. As the tunnel ends, the camera speeds into a dense forest of vines, their leaves and tendrils swaying wildly. The vines grow at an alarming rate, forming a maze-like structure that twists and turns. The car appears to be navigating this treacherous path, with the driver focused intently on the winding route. The background is filled with blurred, green foliage and twisted branches, creating a sense of urgency and chaos. The photo has a gritty, hyperrealistic texture, emphasizing the dynamic movement and intense visual effects. A wide-angle shot from a low angle, capturing the car's rapid descent into the vine-laden labyrinth.
Baseline
+ Relax Forcing

Comparison with SOTA Methods

A stylish woman strolls down a bustling Tokyo street, the warm glow of neon lights and animated city signs casting vibrant reflections. She wears a sleek black leather jacket paired with a flowing red dress and black boots, her black purse slung over her shoulder. Sunglasses perched on her nose and a bold red lipstick add to her confident, casual demeanor. The street is damp and reflective, creating a mirror-like effect that enhances the colorful lights and shadows. Pedestrians move about, adding to the lively atmosphere. The scene is captured in a dynamic medium shot with the woman walking slightly to one side, highlighting her graceful strides.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A stunning mid-afternoon landscape photograph with a low camera angle, showcasing several giant wooly mammoths treading through a snowy meadow. Their long, wooly fur gently billows in the brisk wind as they move, creating a sense of natural movement. Snow-covered trees and dramatic snow-capped mountains loom in the distance, adding to the majestic setting. Wispy clouds and a high sun cast a warm glow over the scene, enhancing the serene and awe-inspiring atmosphere. The depth of field brings out the detailed textures of the mammoths and the snowy environment, capturing every nuance of these prehistoric giants in breathtaking clarity.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A drone view of waves crashing against the rugged cliffs along Big Sur's Garay Point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore, casting long shadows. In the distance, a small island with a lighthouse stands tall, its beam piercing the twilight. Green shrubbery covers the cliff's edge, and the steep drop from the road down to the beach is a dramatic feat, with the cliff's edges jutting out over the sea. The camera angle provides a bird's-eye view, capturing the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway. The scene is bathed in a warm, golden hue, highlighting the textures and details of the rocky terrain.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A dynamic and vibrant anime illustration in a flowing watercolor style, capturing the bustling snowy streets of Tokyo. The camera moves smoothly through the city, following several people joyfully enjoying the snow and shopping at nearby stalls. Gorgeous sakura petals dance through the air, swirling with snowflakes. The scene features traditional Japanese architecture, with shops and lanterns illuminated by the soft winter light. People are bundled up in warm coats and scarves, their faces lit with smiles. The background shows blurred, snowy rooftops and distant cherry blossom trees, creating a serene yet lively atmosphere. A medium shot with a sweeping camera motion, highlighting the natural movement of both people and petals.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A charming comic-style illustration depicting a cozy living room scene where a fluffy gray cat is waking up its sleeping owner, who lies on the couch with a sleepy, resigned expression. The cat, with large, round eyes and a mischievous look, is pawing at the owner's face and meowing insistently. The owner attempts to ignore the cat, turning away slightly, but the cat persists, jumping onto the owner's chest and nuzzling their hand. Finally, the owner, unable to resist, reaches under the pillow and pulls out a small bag of treats, offering it to the cat with a playful smile. The background shows soft, warm lighting from a nearby lamp, with scattered books and a blanket on the couch. A medium shot from a slightly elevated angle, capturing both the cat and the owner's interaction.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A cyberpunk-style illustration depicting a lone robot navigating a neon-lit cityscape. The robot stands tall with sleek, metallic armor, adorned with blinking lights and wires. Its eyes, glowing with a deep blue hue, scan the surroundings with curiosity. The background features towering skyscrapers, holographic advertisements, and crowded streets filled with various cyborgs and humans. The air is thick with smoke and the hum of technology. A medium shot from a high-angle perspective, capturing both the robot and the bustling city environment.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A macro shot of a volcanic eruption in a coffee cup, capturing the dramatic moment in vivid detail. The coffee cup is filled with rich, dark brown liquid, and the surface is suddenly disrupted by a burst of foam and steam, mimicking the intense heat and pressure of a real volcanic eruption. The foam rises and spreads across the surface, creating a chaotic yet mesmerizing pattern. The cup itself is made of ceramic, with intricate patterns etched into the sides, adding texture and depth to the scene. The background is a blurred gradient of warm browns and grays, enhancing the focus on the erupting foam. The lighting is dramatic, casting shadows and highlighting the dynamic movement of the foam. A close-up shot from a low angle, emphasizing the explosive nature of the eruption.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A nighttime scene from a vintage film-style photograph, depicting a giant, otherworldly creature slowly walking down a desolate, rundown city street. Only one dim streetlamp casts flickering shadows, illuminating the creature's massive, imposing form. Its skin is rough and covered in peculiar growths, with glowing eyes that reflect the dim light. The creature's steps echo in the empty alleyways, creating a sense of eerie quiet. The background features crumbling buildings, broken windows, and trash-strewn sidewalks. The photo has a grainy texture and a muted color palette, capturing the haunting atmosphere of the scene. A medium shot with a slight tilt to the camera, emphasizing the creature's movement and presence.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)
A cinematic scene from a classic western movie, featuring a rugged man riding a powerful horse through the vast Gobi Desert at sunset. The man, dressed in a dusty cowboy hat and a worn leather jacket, reins tightly on the horse's neck as he gallops across the golden sands. The sun sets dramatically behind them, casting long shadows and warm hues across the landscape. The background is filled with rolling dunes and sparse, rocky outcrops, emphasizing the harsh beauty of the desert. A dynamic wide shot from a low angle, capturing both the man and the expansive desert vista.
Self-Forcing
Attention Sink
Rolling-Forcing
Relax Forcing (Ours)

Ablation Studies

Temporal memory roles during long-horizon rollout.
No KV Memory
Only Sink
Only History
Only Tail
Complementary roles of temporal memory components during long-horizon rollout.
No Sink
No History
No Tail
Full

Quantitative Results

Quantitative Results Comparison