Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

Subhransu Maji

Subject Categories

Artificial Intelligence and Robotics | Computer Sciences


In this thesis, we present a simple and effective architecture called Bilinear Convolutional Neural Networks (B-CNNs). These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs generalize classical orderless texture-based image models such as bag-of-visual-words and Fisher vector representations. However, unlike prior work, they can be trained in an end-to-end manner. In the experiments, we demonstrate that these representations generalize well to novel domains by fine-tuning and achieve excellent results on fine-grained, texture and scene recognition tasks. The visualization of fine-tuned convolutional filters shows that the models are able to capture highly localized attributes. We present a texture synthesis framework that allows us to visualize the pre-images of fine-grained categories and the invariances that are captured by these models. In order to enhance the discriminative power of the B-CNN representations, we investigate normalization techniques for rescaling the importance of individual features during aggregation. Spectral normalization scales the spectrum of the covariance matrix obtained after bilinear pooling and offers a significant improvement. However, the computation involves singular value decomposition, which is not computationally efficient on modern GPUs. We present an iteration-based approximation of matrix square-root along with its gradients to speed up the computation and study its effect on fine-tuning deep neural networks. Another approach is democratic aggregation, which aims to equalize the contributions of individual feature vector into the final pooled image descriptor. This achieves a comparable improvement, and can be approximated in a low-dimensional embedding unlike the spectral normalization. Therefore, this approach is friendly to aggregating higher-dimensional features. We demonstrate that the two approaches are closely related, and we discuss their trade-off between performance and efficiency.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.