3D point-cloud-based perception is a challenging but crucial computer vision task. A point cloud consists of a sparse, unstructured, and unordered set of points. To understand a point cloud, previous point-based methods, such as PointNet++, extract visual features through the hierarchical aggregation of local features. However, such methods have several critical limitations: 1) Such methods require several sampling and grouping operations, which slow down the inference speed. 2) Such methods spend an equal amount of computation on each point in a point cloud, though many of the points have similar semantic meanings. 3) Such methods aggregate local features together through down-sampling, which leads to information loss and hurts the perception performance. To overcome these challenges, we propose a simple, and elegant deep learning model called YOGO (You Only Group Once). YOGO divides a point cloud into a small number of parts and extracts a high-dimensional token to represent points within each sub-region. Next, we use self-attention to capture token-to-token relations and project the token features back to the point features. We formulate the mentioned series of operations as a relation inference module (RIM). Compared with previous methods, YOGO only needs to sample and group a point cloud once, thus it is very efficient. Instead of operating on points, YOGO operates on a finite and small number of tokens, each of which summarizes the point features in a sub-region. This allows us to avoid computing on the redundant points and thus boosts efficiency. Moreover, YOGO preserves point-wise features by projecting token features to point features although the computation is performed on tokens. This avoids information loss and can improve point-wise perception performance. We conduct thorough experiments to demonstrate that YOGO achieves at least 3.0x speedup over point-based baselines while delivering competitive segmentation performance on the ShapeNetPart and S3DIS dataset.




Download Full History