Back to index page

Scanning real world objects without worries



One of the more challenging tasks in the automatic processing of data acquired from 3D scans is to do a coarse registration of different scans. In particular this means, that we want to find an rigid transformation that maps the points from one scan to the corresponding points of another. Once we have got this transformation further refinement can be done by using a fine registration like ICP.

We have implemented different ways of computing features within an image and we will give a short outline for each of them. A feature is needed to reduce the dense space of possible combinations for corresponding points.

The reference manual of the classes will give you an detailed description on how to use them. Here we only consider the theoretical background and the motivation to follow the implementation. For further reading we recommend the original papers as they will give a more elaborated description.

SIFT - Scale Invariant Feature Transform

A common approach to compute features in 2D images is SIFT, the Scale Invariant Feature Transform, that was first proposed by David Lowe (1999). The Algorithm extracts features that are invariant to rotation, scaling an partially invariant to changes in illumination an affine transformations. The features generated should be highly distinctive. In short the algorithm works as follows:

Space and Difference of Gaussian

The main idea behind SIFT is to select the features from distinct points in the difference of Gaussian. First off, the scale space is defined as the convolution of an image $I$ with a Gauss kernel $G$ with variance $\sigma$.

\[ L(x,y,\sigma) = G(x,y,\sigma) * I(x,y) \]

The difference of Gaussian is the difference between two layers in scale space along the $\sigma$ axis:

\[ D(x,y,\sigma) = (G(x,y, k\sigma) - G(x,y,\sigma)) * I(x,y) = L(x,y,k\sigma)-L(x,y,\sigma) \]



The difference of Gaussian provides a close approximation to the scale-normalize laplacian of Gaussian an hence a relationship to the heat diffusion equation

\[ \sigma\nabla^2G = \frac{\partial G}{\partial \sigma} \approx \frac{G(x,y,k\sigma-G(x,y,\sigma}{k\sigma - \sigma} \]

and therefore

\[ G(x,y,k\sigma) - G(x,y,\sigma) \approx (k-1)\sigma^2\nabla^2 \]

Extrema in Scale Space

The localization of the keypoints is done by finding the extrema in the difference of Gaussian. First there is an threshold filter applied to the difference of Gaussian, to reduce complexity. After that global minima and maxima will be computed. In the next step the accuracy of the keypoints is increased by fitting a quadric to the local samples. The Taylor expansion can be used to find the extrema:

\[ D(x) = D+\frac{\partial D^T}{\partial x}x+\frac{1}{2}x^T\frac{\partial^2 D}{\partial x^2}x \Rightarrow \hat{x} = -\frac{\partial^2 D}{\partial x^2}^{-1}\frac{\partial D}{\partial x} \]

To reduce the number of features, first the points with low contrast are will be removed as they are more sensitive to noise. The paper from Lowe suggests to discard $\left| D(\hat{x} \right| < 0.03$, where $D(\hat{x})=D+\frac{1}{2}\frac{\partial D^T}{\partial x}\hat{x}$. Along that the features on edges are discarded as they are not well distinguishable. Following the approach by Harris & Stephens (1988), that edges have large principal curvature across the edge but small perpendicular to it, the ratio of principal curvature is computed in terms of eigenvectors by using the Gaussian and mean curvature using the Hessian matrix.




To find the orientation the gradients are pre computed and an orientation histogram of 36 bins weighted by the magnitude is used.



The features are derived from a 4x4 gradient window by using an histogram of 4x4 sampels per window in 8 direction The gradients are then Gaussian weighted around the center. This leads to an 128 dimensional feature vector.





The SIFT algorithm is only invariant to 3D rotation under orthographic projection. In our tests it was only stable for rotation around 30 degree. The library we used is called siftpp

Curvature based features

As we deal with depth images, the SIFT algorithm will fail, as it only works reliable on textured images and depth images are in general smooth. We can interpret the depth images (or point clouds) as points scattered on a surface and therefore consider properties from differential geometry to define features. We follow the approach of Taubin (1995) to compute the principal curvatures and derive the features from them. To calculate the curvatures we first need to approximate the normals.

Normal computation

We adopted two approaches to calculate the normals for the surface. As we deal with depth images the points lie on a regular grid and hence we can simply use the cross-product with respsect to adjacent points to calculate the normals. This is very fast but obviously sensitive to noise. We can also approximate the tangent space $T_i$ at a point $p_i$ by minimizing

\[ \varepsilon(n,r) = \sum_{x\in N_k(p_i)} ( <x-b,n> -r )^2 \]

where $n$ is the normal vector of the plane and $r$ the distance of the minimizing hyperplane through $b$. We can now consider this as an eigenvalue problem of the convariance matrix $M_i$ where $M_i \cdot n = \varepsilon(n,r) \cdot n$. The eigenvector with smallest eigenvalue is then our normal. Again, as we assume that we have depth images, we can fix the orientation of the normals, so they always point in direction of the camera.


The approach by Taubin was generalized by Lange and Polthier (2005) for point set surfaces and can be summarized as follows:

For a unit length vector $v$ in the tangent space in $p$ of a surface $S$ the directional curvature can be obtain as

\[ \kappa_p(v) = \lim_{s \to 0} \dfrac{2 \left< n, \gamma(s) - p \right> }{ \left\| \gamma(s) - p \right\| ^2 } \]

where $n$ is the surface normal and $\gamma$ a curve in the surface with $\gamma(0)=p$ and $\gamma'(0)=v$. We can now define a discrete directional curvature $\kappa_{ij}$ where

\[ \kappa_{ij} = \frac{ 2 \left< n_i, p_j - p_i \right> }{ \parallel p_j - p_i \parallel^2 } \]

The directional curvature $\kappa_p$ satisfies

\[ \kappa_p(e_\varphi) = e^\perp_\varphi \left( {\begin{array}{*{20}c} \kappa^{11}_p & \kappa^{12}_p \\ \kappa^{21}_p & \kappa^{22}_p \end{array} } \right) e_\varphi \]

Where $e_\varphi = \left( cos \varphi , sin \varphi \right)$ is relative to a basis $\{v_1, v_2\}$ of $T_pS$ with $\kappa^{11}_p = \kappa_p(v_1)$, $\kappa^{22}_p = \kappa_p(v_2)$ and $\kappa^{12}_p = \kappa^{21}_p$

We now assume that $v_1$ and $v_2$ are principal curvature directions and hence the Weingarten map $W_p$ is $\left( {\begin{array}{*{20}c} \kappa_p^1 & 0 \\ 0 & \kappa_p^2 \end{array} } \right)$. Such we can solve:

\[W_p = \frac{1}{2\pi} \int^{2\pi}_0 \mu_\varphi e_\varphi e^\perp_\varphi\]

where $\mu_\varphi = \mu_1 \cos^2 \varphi + \mu_2 \sin^2 \varphi$. From that we can derive

\[ W_i = \frac{1}{\pi} \int^{2\pi}_0 ( 2\kappa_\varphi - H_p ) e_\varphi e^{\perp}_\varphi \]

and after discretization one approaches

\[ W_i = \frac{1}{\pi} \sum_{ p_i \in N_i } \frac{ 4\pi \parallel p_i - p_j \parallel }{ | N_i | \delta_{ij} } ( 2\kappa_{ij} - H_i ) e^{tan}_{ij} e^{tan\perp}_{ij} \]

The eigenvalues and eigenvectors of the discrete shap operator $W_i$ are now the principal curvatures and principal curvature directions in $p_i$.


From image to curvature


The points with maximal mean curvature $H=\frac{1}{2}(\kappa_1 + \kappa_2)$ will be considered as features. We implemented an priority queue for that. As the highest values may be outliners an optional number of features can be skipped.


The main problem with this curvature based approach is, that it is sensitive to noise. We may note, that in our tests we only considered synthetic images without noise and we did no exact error analysis. Another problem is, that the number of features can be quite high, as there are may be large regions with high curvature and hence the features might be ambiguous. Finally the curvature based features will of course fail on surfaces with constant mean curvature like spheres.

Mesh Saliency

Lee, Varshney and Jacobs applied the concept of the difference of Gaussian on the curvatures of a surface. The saliency they define has different applications like simplification algorithms. But in essence delivers frequency ranges of the curvature. We adopt this and in a first step define

\[ G(v, \sigma) := \frac{\sum_{x \in N(v,2\sigma) }{ H_x e^{ - \parallel x - v \parallel^2 / 2(\sigma^2) } }}{ \sum_{x \in N(v,2\sigma) }{ e^{ - \parallel x - v \parallel^2 / 2(\sigma^2) } }} \]

as the Gaussian-weight average of the mean curvature $H$ at a point $v$ with variance $\sigma$. The saliency then is defined in terms of the difference of Gaussian:

\[ S_i(v) = | G( v, \sigma_i) - G( v, 2\sigma_i) | \]

Where $\sigma_i$ is the standard deviation of the Gaussian filter at scale $i$. The paper suggests a scales of $\{2\epsilon, 3\epsilon, 4\epsilon, 5\epsilon, 6\epsilon\}$ where $\epsilon$ is $0.3\%$ of the length of the diagonal of the bounding box.




Like for the curvature we choose our features to be the one with highest saliency.

Saliency features


The problems from the plain curvature approach now moves on to the choice of an reasonable value of $\sigma$. But the effect of noise can be reduced this way as we can choose a higher value of \đ$$ and thus smooth out noise. Still features for surfaces with constant mean curvature deliver are not well defined. Like the curvature the number of features has to be quite high to have an sufficient number of potential matches.

Feature Metric

Until now we only considered scalar (or possibly more dimensional features when considering the principal directions) values for each feature. The problem is, that those values are not very distinct and will not allow to do a precise matching. An other approach to determine a coarse rigid transformation was proposed by Pingi et. al as "`Exploiting the scanning sequence for automatic registration of large sets of range maps"'. They require to have consecutive range maps that overlap for at least 15 to 20%. The basic principles of the algorithm will be explained in short. We adapted the feature metric they proposed and use it do make our features more distinguishable.


Scanning sequence


The algorithm can be outlined in the following steps given two depth images $R_a$ and $R_b$:


Assume we have a depth (range) images $R_a$, then the features are described by a kernel $K_p$ around a point $p \in R_a$ where the entries of the matrix are $k_{i,j}^p = \left< n_p, n_t \right>$ and $n_p$ denotes the normal at the point $p$ and $t$ are the adjecent pixels around $p$.

From there we can compute the variance of each kernel as:

\[ s^2 \left( {K_p } \right) = \frac{1}{{n^2 }}\sum\limits_{i,j} {\left( {k_{i,j}^p - E\left( {K_p } \right)} \right)^2 } \]

where $E\left( {K_p } \right) = \sum\limits_{i,j} {\frac{{k_{i,j}^p }}{{n^2 }}}$. This also approximates the curvature of the surface. This way the algorithm works only in the image space, hence the gaps on boundaries (or other transistion with a disconutiy) are treated as connected and thus result in a high curvature. Thats why they only choose the points with medium variance ($\approx$medium curvature) as features.




The metric to measure distances between features is given by the frobenius norm of the difference of the kernels:

\[ d^2 \left( {K_p } \right) = \frac{1}{{n^2 }}\sum\limits_{i,j} {\left( {k_{i,j}^p - k_{i,j}^q } \right)^2 } \]

Then $t$ features with smallest distance to the other image to match are chosen. Call this set $Q$.


Generate for all $\binom{t}{3}$ combinations of the $t$ points a transformation $M_j$ and take the one with minimal

\[ \varepsilon _j = \frac{1}{3}\sum\limits_{\left( {p,q} \right) \in Q_j } {\left( {\left\| {p - M_j q} \right\|^2 } \right)} \]

That is we compute the transformations for all possible pairs of matches each time for 3 pairs and choose the one with the lowest error with respect to this 3 pairs. If $\varepsilon_j > coarse\_err$ restart and pick again some random features.

Feature Metric

We adopted the feature metric

\[ d^2 \left( {K_p } \right) = \frac{1}{{n^2 }}\sum\limits_{i,j} {\left( {k_{i,j}^p - k_{i,j}^q } \right)^2 } \]

for the other features we defined. That is we take the neighborhood of the features to make them more distinguishable.


There are some improvements that can be made and some ideas that where not implemented yet. One idea would be to follow more directly the concept of difference of Gaussian in the sense of SIFT when dealing with curvatures. This means that we do not look at a fixed variance to compute the saliency but to search along the different variance (scales) for extrema and define features this way. Moreover there might be better ways to describe the feature vector itself as the description by an neighborhood matrix is not invariant under rotation around the camera axis. We have to assume that the depth maps are taken in a reasonable sequence with the current descriptor. One other way to define the kernel might be to orient it by the principal frame of a given point.


G. Taubin, Estimating The Tensor Of Curvature Of A Surface From A Polyhedral Approximation, Fifth International Conference on Computer Vision (ICCV'95).

C. Lange and K. Polthier, Anisotropic Smoothing of Point Sets, Special Issue of CAGD 2005

Chang Ha Lee and Amitabh Varshney and David W. Jacobs, Mesh Saliency, ACM SIGGRAPH 2005

Paolo Pingi et al., Exploiting the scanning sequence for automatic registration of large sets of range maps, EUROGRAPHICS 2005

David G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, 2004