When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases.

Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead.

Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration).

]]>Specifically, in observational data researchers have lack of control over the data generation process. This results in a fundamental challenge: **the presence of confounder variables** (i.e., variables that affect both treatment and outcome). Such variables, when not adjusted statistically, can result in biased causal estimates. When confounder variables are observed, many methods can be used to adjust for their effect. However, in most real world observational data sets, accurately measuring all potential confounder variables is far from feasible, hence important confounder variables are likely to remain unobserved. The central idea of this thesis is to **explicitly account for unobserved confounders by inferring their values using a predictive model**.

This thesis presents three main contributions in the intersection of machine learning and causal estimation. First, we present one of the earliest application of causal estimation methods from social sciences to social media platforms to answer three causal questions. Second, we present a novel generative model for estimating ordinal variables with distant supervision. We also apply this model to data from US Twitter user population and discover variation in behavior among users from different age groups. Third, we characterize the behavior of an effect restoration model based on graphical models with theoretical analysis and simulation studies. We also apply this effect restoration model with predictive models to account for unobserved confounder variables.

]]>Here we report on our progress towards MP arithmetic libraries on the GPU in four areas: (1) large integer addition, subtraction, and multiplication; (2) high performance modular multiplication and modular exponentiation (the key operations for cryptographic algorithms) across generations of GPUs; (3) high precision floating point addition, subtraction, multiplication, division, and square root; (4) parallel short division, which we prove is asymptotically optimal on EREW and CREW PRAMs.

]]>I will first present a deep learning technique that generates 3D shapes by translating an input sketch to parameters of a predefined procedural model. The inferred procedural model parameters then yield multiple, detailed output shapes that resemble the user's input sketch. At the heart of our approach is a deep convolutional network trained to map sketches to procedural model parameters.

Procedural models are not readily available always, thus I will present a deep learning algorithm that is capable of automatically learning parametric models of shape families from 3D model collections. The parametric models are built from dense point correspondences between shapes. To compute correspondences, we propose a probabilistic graphical model that learns a collection of deformable templates that can describe a shape family. The probabilistic model is backed by a deep convolutional network that learns surface point descriptors such that accurate point correspondences are established between shapes.

Based on the estimated shape correspondence, I will introduce a probabilistic generative model that hierarchically captures statistical relationships of corresponding surface point positions and parts as well as their existence in the input shapes. A deep learning procedure is used to capture these hierarchical relationships. The resulting generative model is used to produce control point arrangements that drive shape synthesis by combining and deforming parts from the input collection.

With these new data driven modeling algorithms, I hope to significantly shorten the design cycle of 3D products and let detail-enriched visual content creation become easy for casual modelers.

]]>We start by helping users find the data. In the real world, public data is everywhere on the Web, but it is scattered around. We extract a prototype relational knowledge base to solve this problem. We start from the most basic binary mapping relationships (sometimes also named bridge tables) between entities from the web. This mapping relationship facilitates many data transformation applications such as auto-correct, auto-fill, and auto-join.

After finding the data, we help users explore the data. When users issue queries to explore the data, their query results may contain too many items. So the system designer has to present a small subset of representative and diverse items rather than all items. This is known as the query result diversification problem. We propose the RC-Index, which helps to solve the diversification problem by significantly reducing the number of items that must be retrieved by the database to form a diverse set of a desired size. It is nearly an order of magnitude faster than the state-of-the-art and has a good performance guarantee, which improves the ease of use of databases in terms of querying.

Finally, we shift our focus from data to computing capacities. We propose a framework to help users choose configurations in the cloud. Cloud computing has revolutionized data analysis, but choosing the right configuration is challenging because the common pricing mechanism of the public cloud is too complicated. Users have to consider low-level resources to find the best plan for their computational tasks. To address this issue, we propose a new market-based framework for pricing computational tasks in the cloud. We introduce agents to help users configure their personalized databases, which improves the ease of use of databases in the cloud.

]]>Our contributions include the definition of the concept of triads for conjunctive queries, which is a crucial tool on our analysis, and the characterization of a NP versus P dichotomy for the resilience problem considering the class of conjunctive queries without self-joins. Moreover, this result allowed us to show dichotomies for the same class of queries for both deletion propagation with source side-effects and causal responsibility problems. We also completely characterize how the presence of functional dependencies can change the complexity of such problems.

The class of conjunctive queries with self-joins is far richer and more complicated than the self-join-free ones. Therefore we focus on binary queries without variable repetition, which are queries formed by unary or binary relations only and each atom has only one occurrence of any variable. For this restricted case, we identify three main query structures that help us identify complexity: chains, permutations and confluences. Using those we are able to characterize classes of queries for which resilience is NP-complete and some for which it is P.

]]>