Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations
Published in arXiv preprint, 2026
We investigate zero-shot cross-city generalization in end-to-end autonomous driving, focusing on the role of visual representation learning. By enforcing strict geographic splits across cities in nuScenes and NAVSIM, we isolate the effect of backbone pretraining while keeping the planning architecture fixed. Our results show that supervised ImageNet-pretrained models suffer significant performance degradation when transferred across cities, particularly under shifts in driving conventions. In contrast, self-supervised representations such as I-JEPA, DINOv2, and MAE consistently improve cross-city robustness, highlighting representation learning as a key factor for generalization in autonomous driving.
Recommended citation: F. Naeinian, A. Hamza, H. Zhu, A. Choromanska, 'Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations,' arXiv:2603.11417, 2026.
Download Paper
