LCD: Learned Cross-Domain Descriptors for 2D-3D Matching

Quang-Hieu Pham1     Mikaela Angelina Uy2     Binh-Son Hua3     Duc Thanh Nguyen4    
Gemma Roig5     Sai-Kit Yeung6    

1Singapore University of Technology and Design     2Stanford University     3The University of Tokyo    
4Deadkin University     5Geothe University of Frankfrut am Main     6Hong Kong University of Science and Technology

AAAI Conference on Artificial Intelligence, 2020 (Oral).

Our proposed network consists of a 2D auto-encoder and a 3D auto-encoder. The input image and point cloud data is reconstructed with a photometric and a Chamfer loss, respectively. The reconstruction losses ensures features in the embedding to be discriminative and representative. The similarity between the 2D embedding and the 3D embedding is further regularized by a triplet loss. Diagram notation: fc for fully-connected, conv/deconv(kernel_size, out_dim, stride, padding) for convolution and deconvolution, respectively. Each convolution and deconvolution is followed by a ReLU activation and a batch normalization by default.


In this work, we present a novel method to learn a local cross-domain descriptor for 2D image and 3D point cloud matching. Our proposed method is a dual auto-encoder neural network that maps 2D and 3D input into a shared latent space representation. We show that such local cross-domain descriptors in the shared embedding are more discriminative than those obtained from individual training in 2D and 3D domains. To facilitate the training process, we built a new dataset by collecting ≈ 1.4 millions of 2D-3D correspondences with various lighting conditions and settings from publicly available RGB-D scenes. Our descriptor is evaluated in three main experiments: 2D-3D matching, cross-domain retrieval, and sparse-to-dense depth estimation. Experimental results confirm the robustness of our approach as well as its competitive performance not only in solving cross-domain tasks but also in being able to generalize to solve sole 2D and 3D tasks.



Qualitative image matching comparison between SIFT and our proposed descriptor. Our descriptor can correctly identify features from the wall and the refrigerator, while SIFT fails to differentiate them.
Qualitative geometric registration results on the 3DMatch benchmark. Our method is able to successfully align pair of fragments in different challenging scenarios by matching local 3D descriptors, while 3DMatch fails in cases when there are ambiguities in geometry.
Top-3 retrieval results of the 2D-3D place recognition task using our descriptor. Our task is to find the corresponding 3D geometry submap(s) in the database given a query 2D image. Green/red borders mark correct/incorrect retrieval. It can be seen that our learned cross-domain descriptor are highly effective in this 2D-to-3D retrieval task.
Sparse-to-dense depth estimation results. Inputs are a RGB image and 2048 sparse depth samples. Our network estimates dense depth map by reconstructing local 3D points.



  title = {{LCD}: {L}earned cross-domain descriptors for 2{D}-3{D} matching},
  author = {Pham, Quang-Hieu and Uy, Mikaela Angelina and Hua, Binh-Son and Nguyen, Duc Thanh and Roig, Gemma and Yeung, Sai-Kit},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year = 2020


This research project is partially supported by an internal grant from HKUST (R9429).