Abstract: With the development of multimodal learning technology, the demand for combining text information to achieve more accurate object detection tasks is growing. Traditional object detection ...