Abstract: Video-text cross-modal retrieval (VTR) is more natural and challenging than image-text retrieval, which has attracted increasing interest from researchers in recent years. To align VTR more ...