Mathematics and Statistics Department Faculty Publication Series

GEMINI: a computationally-efficient search engine for large gene expression datasets

Timothy DeFreitas, Worcester Polytechnic Institute
Hachem Saddiki, University of Massachusetts AmherstFollow
Patrick Flaherty, University of Massachusetts - AmherstFollow

Publication Date

2016

Journal or Book Title

BMC Bioinformatics

Abstract

Background

Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query – a text-based string – is mismatched with the form of the target – a genomic profile.

Results

To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an O(log n) expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 105samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec.

Conclusions

GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information.

DOI

https://doi.org/10.1186/s12859-016-0934-8

Volume

Issue

102

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Funder

UMass SOAR Fund

Recommended Citation

DeFreitas, Timothy; Saddiki, Hachem; and Flaherty, Patrick, "GEMINI: a computationally-efficient search engine for large gene expression datasets" (2016). BMC Bioinformatics. 1207.
https://doi.org/10.1186/s12859-016-0934-8

Download

Find in your library

Included in

Genomics Commons, Theory and Algorithms Commons

COinS

ScholarWorks@UMass Amherst

Mathematics and Statistics Department Faculty Publication Series

GEMINI: a computationally-efficient search engine for large gene expression datasets

Publication Date

Journal or Book Title

Abstract

Background

Results

Conclusions

DOI

Volume

Issue

Creative Commons License

Funder

Recommended Citation

Included in

Browse

Author Corner

Links

ScholarWorks@UMass Amherst

Mathematics and Statistics Department Faculty Publication Series

GEMINI: a computationally-efficient search engine for large gene expression datasets

Authors

Publication Date

Journal or Book Title

Abstract

Background

Results

Conclusions

DOI

Volume

Issue

Creative Commons License

Funder

Recommended Citation

Included in

Share

Browse

Author Corner

Links