With recent advancements in image generation quality, there is a growing concern around safety, privacy, and copyrighted content in diffusion-model-generated images. Recent works attempt to restrict undesired content via inference-time interventions or post-generation filtering, but such methods often fail when users have direct access to model weights or can bypass constraints through adversarial prompts.
In this work, we propose Erasure Autoregressive Model (EAR), a fine-tuning method for effective and utility-preserving concept erasure in AR models. Specifically, we introduce Windowed Gradient Accumulation (WGA) strategy to align patch-level decoding with erasure objectives, and Thresholded Loss Masking (TLM) strategy to protect content unrelated to the target concept during fine-tuning. Furthermore, we propose a novel benchmark, Erase Concept Generator and Visual Filter (ECGVF), aim at provide a more rigorous and comprehensive foundation for evaluating concept erasure in AR models. Specifically, we first employ structured templates across diverse large language models (LLMs) to pregenerate a large-scale corpus of target-replacement concept prompt pairs. Subsequently, we generate images from these prompts and subject them to rigorous filtering via a visual classifier to ensure concept fidelity and alignment. Extensive experimental results conducted on the ECGVF benchmark with the AR model Janus-Pro demonstrate that EAR achieves marked improvements in both erasure effectiveness and model utility preservation.
Since large-scale autoregressive models such as Janus-Pro are trained to predict and reconstruct data sequences from vast datasets, they inherently learn to generate sensitive, copyrighted, or harmful content-from nudity and artistic styles to trademarked symbols. These capabilities pose escalating risks: their autoregressive nature enables step-by-step synthesis of deepfake porn, exacerbating consent violations; their iterative patch-based generation precisely replicates artistic styles, threatening creators' livelihoods; and their latent space memorization regurgitates copyrighted content without attribution. For institutions deploying such models, these risks carry legal, ethical, and financial consequences.
Existing inference-time filters or post-hoc checks fail in autoregressive settings users can bypass them by exploiting incremental generation or adversarial prompts. Janus-Pro addresses this by editing the model's autoregressive weight dynamics : we permanently erase concepts by realigning their latent patches toward neutral substitutes during each generation step. This disrupts the model's ability to reconstruct undesired sequences while preserving fidelity for other content-all via lightweight fine-tuning.
We leverage the model's own autoregressive knowledge to selectively unlearn concepts. Rather than gathering external datasets or manually filtering training data, we utilize the model's existing patch-based representations to guide concept removal.
The approach is elegant yet effective: the pretrained autoregressive model Pθ*(x) already encodes rich hierarchical relationships between concepts c, so we transform it into Pθ(x) by redirecting target concept patches toward semantically similar but permissible alternatives:
This builds upon principles from compositional energy-based models, but implements them through patch realignment in the autoregressive framework. Instead of subtracting concept-conditioned components, we progressively nudge the model's attention mechanisms away from undesirable concept patches during generation.
The network architecture of EAR. EAR introduces Windowed Gradient Accumulation (WGA) to avoid single-token perturbation from disrupting sequence dependencies, aligning with the autoregressive generation characteristics of AR models. It also introduces Thresholded Loss Masking (TLM) to block updates in irrelevant regions, guiding TLM to precisely isolate target semantic areas.
We present a systematic approach for concept erasure in autoregressive models; our method decomposes the process into three distinct phases that progressively modify the model's internal representations. Through quantitative experiments, we compare our staged approach with single-step erasure methods including direct fine-tuning and gradient reversal techniques. We also measure the preservation of non-target concepts during each phase of the erasure process.
Since NSFW content can be generated through implicit contextual cues, we implement Janus-Pro's three-phase patching protocol to modify both conditioned and unconditioned model parameters. Our comparative analysis evaluates against inference-time filters (Safe Latent Diffusion) and pretrained censored models (Stable Diffusion v2.0/2.1), demonstrating superior concept dissociation while maintaining anatomical coherence.
Our analysis reveals significant variations in erasure effectiveness across different concept categories when adjusting the autoregressive window size. Compared to baseline methods using fixed window lengths, our adaptive window approach demonstrates superior performance preservation. The study evaluates three distinct concept types: sensitive content (nudity), architectural elements (church), and artistic styles (Van Gogh).