Research
My recent research focuses on real-time multimodal models, integrating streaming video and audio with LLM. Previously, I have worked on large-scale pretraining, knowledge distillation, and developing instruction-following large multimodal models.
|
|
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, Osamu Yoshie, Yu Liu, Hongsheng Li
Arxiv, 2024
arxiv / code / data
We introduce MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs).
|
|
Instruction-Guided Visual Masking
Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan
Arxiv, 2024
arxiv / code
IVM is a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model.
|
|
GLID: Pre-training a Generalist Encoder-Decoder Vision Model
Jihao Liu, Jinliang Zheng, Yu Liu, Hongsheng Li
CVPR, 2024
arxiv
We proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks.
|
|
Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li
TMLR, 2024
paper /
code
We introduce Unmasked Token Alignment (UTA) for efficient visual-language representation learning.
|
|
GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li
ICCV, 2023
arxiv /
code
We propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection.
|
|
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
Jihao Liu, Xin Huang, Jinliang Zheng, Yu Liu, Hongsheng Li
CVPR, 2023
arxiv /
code
We propose MixMAE for efficient pretraining of hierarchical vision transformers.
|
|
TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers
Jihao Liu, Boxiao Liu, Hang Zhou, Hongsheng Li, Yu Liu
ECCV, 2022
arxiv /
code
A token-level augmentation technique that can be well applied to training various transformer-based architectures.
|
|
UniNet: Unified Architecture Search with Convolution, Transformer, and MLP
Jihao Liu, Xin Huang, Guanglu Song, Hongsheng Li, Yu Liu
ECCV, 2022
arxiv /
code
A high-performance hybrid visual architectures through unified architecture search.
|
|
Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images
Hang Zhou*, Jihao Liu*, Ziwei Liu, Yu Liu, Xiaogang Wang
CVPR, 2020
arxiv /
code
Self-supervised approach for face rotation in the wild.
|
|
Learning Where to Focus for Efficient Video Object Detection
Zhengkai Jiang, Yu Liu, Ceyuan Yang, Jihao Liu, Peng Gao, Qian Zhang, Shiming Xiang, Chunhong Pan
ECCV, 2020
arxiv /
code
We propose LSTS module to learn semantic-level correspondences among adjacent frame features accurately.
|
|
Differentiable Kernel Evolution
Yu Liu*, Jihao Liu*, Ailing Zeng, Xiaogang Wang
ICCV, 2019
PDF
We proposes a differentiable kernel evolution (DKE) algorithm which can find a better layer-operator for the convolutional neural network.
|
|
Meta Knowledge Distillation
Jihao Liu, Jinliang Zheng, Boxiao Liu, Hongsheng Li, Yu Liu
arxiv, 2022
arxiv
We propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters.
|
|
FNAS: Uncertainty-Aware Fast Neural Architecture Search
Jihao Liu, Ming Zhang, Yangting Sun, Boxiao Liu, Guanglu Song, Yu Liu, Hongsheng Li
arxiv, 2021
arxiv
We propose a general pipeline to accelerate the convergence of the rollout process as well as the RL process in NAS.
|
Professional activities
Conference Reviewer for CVPR, ECCV, ICCV, NeurIPS, ICML
|
Selected Honors & Awards
Postgraduate Scholarship, the Chinese University of Hong Kong, 2022-2026
Championship, The Lightweight Face Recognition Challenge & Workshop, ICCV 2019
Gold medal, TensorFlow Speech Recognition Challenge at Kaggle
Rank 1, Face Recognition Vendor Test 1:N Identification, NIST, 2020
|
Teaching
Machine Learning for Multimedia Applications (ELEG5760), Fall 2022
|
The website template was borrowed from Jon Baron.
|
|