Loading...
Citations
Abstract
Modern deep neural generative and reconstruction pipelines excel at visual realism yet rarely expose the underlying structure that users often wish to intuitively control in 2D images or 3D object representations, such as part layout, shape topology, shape style, or feature curves. Voxel grids, point sets, and dense latent tensors entangle coarse arrangement with fine detail, limiting user control to latent space interpolation or prompt engineering. Motivated by the broader agenda of structure-aware shape and image synthesis, this thesis embeds geometry-aligned structural abstractions—part assemblies, medial skeletons, and learned structure-aware shape embeddings -- inside neural implicit functions and text-to-image diffusion models to improve geometric fidelity of generated results while granting intuitive control over the generation process.
Our first contribution is a coarse-to-fine 3D reconstruction pipeline, named ANISE, which maps a partial observation -- either a single-view image or a point cloud scan -- to a full 3D shape. The method decouples structure, expressed as part locations and sizes, from geometry, represented by local signed-distance fields. This separation enables post-hoc editing, such as part swapping, interpolation, and constrained assembly, all within a single forward pass. ANISE can be fine-tuned with whole-shape supervision and supports two inference modes: direct implicit decoding for watertight surfaces, or latent-based retrieval, which replaces each implicit with the nearest mesh part from a database.
Our second contribution is GEM3D, a generative 3D shape pipeline that first produces a point-sampled medial skeleton capturing the shape’s topology, and then maps this skeleton to a neural implicit representation of the surface. The pipeline comprises two diffusion processes. In the first stage, Gaussian noise is denoised into a category-conditioned skeleton; in the second, a latent vector is predicted for every skeletal point. The decoder blends locally supported implicit patches, linking surface thickness to skeletal position and thereby preserving holes, handles, and other high-genus features. A fast, parameter-free skeletonizer provides training pairs for large datasets. The decoder can also operate on generated or hand-drawn skeletons, enabling stochastic synthesis, topology-aware reconstruction, and paving the path to skeleton-based interactive design.
Our third contribution, ShapeWords, bridges explicit 3D geometry and 2D diffusion models by learning a mapping from input 3D objects to text embedding space used in text-to-image generators. This enables image synthesis to be guided jointly by textual descriptions and the target object’s shape, ensuring that the resulting images reflect the style and the silhouettes of the shape. A single scalar modulates guidance strength, allowing users to trade shape fidelity in the generated image against stylistic diversity. Because the guidance operates in the latent space, it remains fully compatible with rich, compositional prompts. Empirical evaluation shows that this \emph{soft, latent-space} guidance achieves geometric adherence with the input shape on par with depth-conditioned methods, while avoiding view-dependent depth maps and preserving the full expressiveness of text prompts.
Altogether, the three methods provide concrete algorithms for incorporating part-level, skeletal, and embedding-based structure into neural implicit and diffusion frameworks. The resulting models uphold topological constraints during reconstruction and generation, extend geometric conditioning to 2D imagery, and maintain the high-fidelity visual detail expected from modern neural pipelines.
Type
Dissertation (Open Access)
Date
2025-09