Abstract: Advancements in Zero-shot Multimodal Egocentric Activity Recognition (ZS-MM-EAR) largely rely on Vision-Language Model (VLM). However, existing methods struggle with VLM’s inadequate ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results