Zhen Zhu

Envision the whole of you.

CS Ph.D. candidate at University of Illinois at Urbana-Champaign (UIUC).

zhenzhu_new.jpg
Picture taken in Hawaii in 2024.

Welcome! I’m currently a final year Ph.D. candidate at the University of Illinois at Urbana-Champaign (UIUC), working with Professor Derek Hoiem. I received my Master’s degree from HUST in June 2020, supervised by Professor Xiang Bai.

My goal is to develop adaptive and trustworthy learning systems that adapt continuously, reason across modalities, and remain under human control.

Research Focus

  • Continual & Dynamic Learning — algorithms that update models in real time without forgetting.
  • Multimodal Models — factual and grounded large multimodal models that integrate images, text, video, etc.
  • Controllable Image Synthesis — autoregressive/diffusion-based models for fast, precise and user-directed editing.

Recent Highlights

  • Forgetting in large multimodal models. New work shows that tuning a small set of layers in a LMM adds skills with almost no forgetting and zero past data.
  • Anytime Continual Learning lets open-vocabulary classifiers absorb new examples in milliseconds while retaining zero-shot scope.
  • Instant/geometry-aware diffusion editing enables few-step, training-free image manipulation on edge devices.

I am currently on the job market for tenure-track faculty, postdoctoral, and research scientist positions beginning in around early 2026. Feel free to reach out if our interests align.

More about me: CV · Google Scholar

News

Mar 18, 2025 I visited Prof. Carl Vondrick’s lab from Columbia University and gave a talk about flexible and dynamic learning.
Feb 19, 2025 I gave a talk at AI2 invited by Prof. Ranjay Krishna about “Towards Flexible Continual Learning and Beyond”.
Dec 07, 2024 I spent a wonderful week at UC Berkeley, visiting Prof. Alyosha Efros’s lab and gave a talk about flexible and dynamic learning.
Sep 20, 2024 Our TreeProbe paper is accepted by TMLR.
Aug 12, 2024 Our AnytimeCL paper is accepted by ECCV2024 as an oral presentation.
May 21, 2024 I started working as a research intern at Google, working with Daniel ReMine and Catherine Zhang.
Aug 01, 2023 One paper about multi-modal generation accepted to WACV2024 in the first round.
May 15, 2023 Internship started at Adobe with the same group.
May 15, 2022 Internship started at Adobe, while working with Yijun, Krishna and Zhixin.
May 01, 2022 One paper accepted to ECCV 2022 as oral presentation.
Aug 11, 2021 Arrived at UIUC.
Mar 01, 2021 Our extended version of the CVPR 2019 pose transfer paper got accepted by TPAMI.
Jan 01, 2021 Enrolled in UIUC and began the Ph.D student life.
Dec 01, 2020 Journey at ShanghaiTech and Next Steps
Aug 01, 2020 Research Assistant Position at ShanghaiTech
🔍 Filter by Research Category

2018

  1. Detection/Segmentation
    2018 CVPR
    DOTA: A Large-scale Dataset for Object Detection in Aerial Images
    Gui-Song Xia Xiang Bai Jian Ding Zhen Zhu Serge Belongie Jiebo Luo Mihai Datcu Marcello Pelillo , and  Liangpei Zhang

    Abstract: Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth’s surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect 2806 aerial images from different sensors and platforms. Each image is of the size about 4000-by-4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using 15 common object categories. The fully annotated DOTA images contains 188,282 instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral. To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging.

  2. Image Generation
    2018 TOG
    Non-stationary Texture Synthesis by Adversarial Expansion
    Yang Zhou* Zhen Zhu* Xiang Bai Dani Lischinski Daniel Cohen-Or , and  Hui Huang
    *Joint first author

    Abstract: The real world exhibits an abundance of non-stationary textures. Examples include textures with large-scale structures, as well as spatially variant and inhomogeneous textures. While existing example-based texture synthesis methods can cope well with stationary textures, non-stationary textures still pose a considerable challenge, which remains unresolved. In this paper, we propose a new approach for example-based non-stationary texture synthesis. Our approach uses a generative adversarial network (GAN), trained to double the spatial extent of texture blocks extracted from a specific texture exemplar. Once trained, the fully convolutional generator is able to expand the size of the entire exemplar, as well as of any of its sub-blocks. We demonstrate that this conceptually simple approach is highly effective for capturing large-scale structures, as well as other non-stationary attributes of the input exemplar. As a result, it can cope with challenging textures, which, to our knowledge, no other existing method can handle.

  3. Abstract: Text in natural images is of arbitrary orientations, requiring detection in terms of oriented bounding boxes. Normally, a multi-oriented text detector often involves two key tasks: 1) text presence detection, which is a classification problem disregarding text orientation; 2) oriented bounding box regression, which concerns about text orientation. Previous methods rely on shared features for both tasks, resulting in degraded performance due to the incompatibility of the two tasks. To address this issue, we propose to perform classification and regression on features of different characteristics, extracted by two network branches of different designs. Concretely, the regression branch extracts rotation-sensitive features by actively rotating the convolutional filters, while the classification branch extracts rotation-invariant features by pooling the rotation-sensitive features. The proposed method named Rotation-sensitive Regression Detector (RRD) achieves state-of-the-art performance on three oriented scene text benchmark datasets, including ICDAR 2015, MSRA-TD500, RCTW-17 and COCO-Text. Furthermore, RRD achieves a significant improvement on a ship collection dataset, demonstrating its generality on oriented object detection.

2019

  1. Abstract: The non-local module works as a particularly useful technique for semantic segmentation while criticized for its prohibitive computation and GPU memory occupation. In this paper, we present Asymmetric Non-local Neural Network to semantic segmentation, which has two prominent components: Asymmetric Pyramid Non-local Block (APNB) and Asymmetric Fusion Non-local Block (AFNB). APNB leverages a pyramid sampling module into the non-local block to largely reduce the computation and memory consumption without sacrificing the performance. AFNB is adapted from APNB to fuse the features of different levels under a sufficient consideration of long range dependencies and thus considerably improves the performance. Extensive experiments on semantic segmentation benchmarks demonstrate the effectiveness and efficiency of our work. In particular, we report the state-of-the-art performance of 81.3 mIoU on the Cityscapes test set. For a 256x128 input, APNB is around 6 times faster than a non-local block on GPU while 28 times smaller in GPU running memory occupation.

  2. Image Generation

    Abstract: This paper proposes a new generative adversarial network for pose transfer, i.e., transferring the pose of a given person to a target pose. The generator of the network comprises a sequence of Pose-Attentional Transfer Blocks that each transfers certain regions it attends to, generating the person image progressively. Compared with those in previous works, our generated person images possess better appearance consistency and shape consistency with the input images, thus significantly more realistic-looking. The efficacy and efficiency of the proposed network are validated both qualitatively and quantitatively on Market-1501 and DeepFashion. Furthermore, the proposed architecture can generate training images for person re-identification, alleviating data insufficiency.

2020

  1. Detection/Segmentation
    2020 ECCV 🏆 Oral
    Xiangtai Li Ansheng You Zhen Zhu Houlong Zhao Maoke Yang Kuiyuan Yang , and  Yunhai Tong

    Abstract: In this paper, we focus on designing effective method for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used – atrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by the Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels, and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Especially, our network is the first to achieve 80.4% mIoU on Cityscapes with a frame rate of 26 FPS.

  2. Abstract: In this paper, we focus on semantically multi-modal image synthesis (SMIS) task, namely, generating multi-modal images at the semantic level. Previous work seeks to use multiple class-specific generators, constraining its usage in datasets with a small number of classes. We instead propose a novel Group Decreasing Network (GroupDNet) that leverages group convolutions in the generator and progressively decreases the group numbers of the convolutions in the decoder. Consequently, GroupDNet is armed with much more controllability on translating semantic labels to natural images and has plausible high-quality yields for datasets with many classes. Experiments on several challenging datasets demonstrate the superiority of GroupDNet on performing the SMIS task. We also show that GroupDNet is capable of performing a wide range of interesting synthesis applications.

2021

  1. Abstract: This paper proposes a new generative adversarial network for pose transfer, i.e., transferring the pose of a given person to a target pose. We design a progressive generator which comprises a sequence of transfer blocks. Each block performs an intermediate transfer step by modeling the relationship between the condition and the target poses with attention mechanism. Two types of blocks are introduced, namely Pose-Attentional Transfer Block (PATB) and Aligned Pose-Attentional Transfer Bloc (APATB). Compared with previous works, our model generates more photorealistic person images that retain better appearance consistency and shape consistency compared with input images. We verify the efficacy of the model on the Market-1501 and DeepFashion datasets, using quantitative and qualitative measures. Furthermore, we show that our method can be used for data augmentation for the person re-identification task, alleviating the issue of data insufficiency.

2022

  1. Image Generation

    Abstract: In this paper, we aim to devise a universally versatile style transfer method capable of performing artistic, photo-realistic, and video style transfer jointly, without seeing videos during training. Previous single-frame methods assume a strong constraint on the whole image to maintain temporal consistency, which could be violated in many cases. Instead, we make a mild and reasonable assumption that global inconsistency is dominated by local inconsistencies and devise a generic Contrastive Coherence Preserving Loss (CCPL) applied to local patches. CCPL can preserve the coherence of the content source during style transfer without degrading stylization. Moreover, it owns a neighbor-regulating mechanism, resulting in a vast reduction of local distortions and considerable visual quality improvement. Aside from its superior performance on versatile style transfer, it can be easily extended to other tasks, such as image-to-image translation. Besides, to better fuse content and style features, we propose Simple Covariance Transformation (SCT) to effectively align second-order statistics of the content feature with the style feature. Experiments demonstrate the effectiveness of the resulting model for versatile style transfer, when armed with CCPL.

2024

  1. Multimodal Learning Continual/Dynamic Learning

    Abstract: We propose an approach for anytime continual learning (AnytimeCL) for open vocabulary image classification. The AnytimeCL problem aims to break away from batch training and rigid models by requiring that a system can predict any set of labels at any time and efficiently update and improve when receiving one or more training samples at any time. Despite the challenging goal, we achieve substantial improvements over recent methods. We propose a dynamic weighting between predictions of a partially fine-tuned model and a fixed open vocabulary model that enables continual improvement when training samples are available for a subset of a task’s labels. We also propose an attention-weighted PCA compression of training features that reduces storage and computation with little impact to model accuracy. Our methods are validated with experiments that test flexibility of learning and inference.

  2. Multimodal Learning Continual/Dynamic Learning

    Abstract: We introduce a method for flexible and efficient continual learning in open-vocabulary image classification, drawing inspiration from the complementary learning systems observed in human cognition. Specifically, we propose to combine predictions from a CLIP zero-shot model and the exemplar-based model, using the zero-shot estimated probability that a sample’s class is within the exemplar classes. We also propose a "tree probe" method, an adaption of lazy learning principles, which enables fast learning from new examples with competitive accuracy to batch-trained linear models. We test in data incremental, class incremental, and task incremental settings, as well as ability to perform flexible inference on varying subsets of zero-shot and learned categories. Our proposed method achieves a good balance of learning speed, target task effectiveness, and zero-shot effectiveness.

2025

  1. 2025 Under Review
    How To Teach Large Multimodal Models New Tricks?
    Under Review
    Multimodal Learning Continual/Dynamic Learning

    Abstract: Large multimodal models (LMMs) are effective for many vision and language problems but may underperform in specialized domains such as object counting and clock reading. Fine-tuning improves target task performance but sacrifices generality, while retraining with an expanded dataset is expensive. We investigate how to teach LMMs new skills and domains, examining the effects of tuning different components and of multiple strategies to mitigate forgetting. We experiment by tuning on new target tasks singly or sequentially and measuring learning as target task performance and forgetting as held-out task performance. Surprisingly, we find that the self-attention projection layers in the language model of the tested LMM can be fine-tuned to learn without forgetting. Fine-tuning the MLP layers in the language model improves learning with much less forgetting than tuning the full model, and employing knowledge distillation regularization mitigates forgetting greatly. We will release code to foster reproducible research on continual adaptation of large multimodal models.

Collaborations

I've had the privilege of mentoring several talented students throughout my Ph.D. journey:

  • Zhiliang Xu — Image Generation and Face Synthesis
  • Yang Liu — Watermark Removal and Image Processing
  • Zijie Wu — Style Transfer and Generative Models
  • Yiming Gong — Machine Learning and Image Editing
  • Joshua Cho — Computational Photography and Image Enhancement
  • Xudong Xie — Texture Synthesis

I'm actively collaborating with current researchers:

  • Yao Xiao — Video Understanding and Multimodal Learning
  • Zhipeng Bao — Multimodal Generation

Service

Co-organizer: UIUC External Speaker Series — Interested speakers are welcome to reach out to register for upcoming sessions

Co-organizer: UIUC Vision Mini-Conference

Conference Reviewer: CVPR, ICCV, ECCV, ICLR, NeurIPS, ICML, AAAI, IJCAI, BMVC, WACV, and others

Journal Reviewer: TPAMI, IJCV, TIP, PR, and others