In this report we describe the development in the perception pipeline, which is crucial towards safe robotic navigation. In Section 2, we briefly recap the previous pipeline, composed of multi-modal (RGB cameras and LiDAR sensors) person detection and joint tracking in a unified world coordinate system, and we present an overview of the updated components. These new components are discussed in detail in the remaining sections of the report. In Section 3, we present extensions to the LiDAR-based person detector, which features a self-supervised learning method for existing detectors using 2D LiDAR sensors [Jia21a] and a newly designed detector using 3D LiDAR sensors [Jia21b]. In Section 4, we discuss the development in flow estimation, including optical flow from images and 2D planar scene flow from LiDAR sensors. These low-level optical and scene flow information serve as a fall-back alternative, should the high-level person detection and tracking become unreliable (e.g. in densely crowded scenarios). In Section 5, we present a new 3D human body pose estimation component [Sarandi21] for advanced person attribute analysis. Section 6 discusses some improvements to the tracker, which addresses issues exposed by previous field testing in crowded market environments. The improved perception pipeline is showcased in Section 7, operating on data collected by EPFL [Paez-Granados21] and UCL during field testing.