Publication:
Spreadsheet Tools for Data Analysts

dc.contributor.advisorEmery D. Berger
dc.contributor.authorBarowy, Daniel W
dc.contributor.departmentUniversity of Massachusetts Amherst
dc.date2024-03-27T18:17:46.000
dc.date.accessioned2024-04-26T15:11:43Z
dc.date.available2024-04-26T15:11:43Z
dc.date.submittedSeptember
dc.date.submitted2017
dc.description.abstractSpreadsheets are a natural fit for data analysis, combining a simple data storage and presentation layer with a programming language and basic debugging tools. Because spreadsheets are accessible and flexible, they are used by both novices and experts. Consequently, spreadsheets are hugely popular, with more than 750 million copies of Microsoft Excel installed worldwide. This popularity means that spreadsheets are the most popular programming language on the planet and the de facto tool for data analysis. Nevertheless, spreadsheets do not address a number of important tasks in a typical analyst's pipeline, and their design frequently complicates them. This thesis describes three key challenges for analysts using spreadsheets. 1) Data wrangling is the process of converting or mapping data from a "raw" form into another form suitable for use with automated tools. 2) Data cleaning is the process of locating and correcting omitted or erroneous data. 3) Formula auditing is the process of finding and correcting spreadsheet program errors. These three tasks combined are estimated to occupy more than three quarters of a data analyst's time. Furthermore, errors not caught during these steps have led to catastrophically bad decisions resulting in billions of dollars in losses. Advances in automated techniques for these tasks may result in dramatic savings in both time and money. Three novel programming language-based techniques were created to address these key tasks. The first, automatic layout transformation using examples, is a program synthesis-based technique that lets spreadsheet users perform data wrangling tasks automatically, at scale, and without programming. The second, data debugging, is technique for data cleaning that combines program analysis and statistical analysis to automatically find likely data errors. The third, spatio-structural program analysis unifies positional and dependence information and finds spreadsheet errors using a kind of anomaly analysis. Each technique was implemented as an end-user tool---FlaskRelate, CheckCell, and ExceLint respectively---in the form of a point-and-click plugin for Microsoft Excel. Our evaluation demonstrates that these techniques substantially improve user efficiency. Finally, because these tools build on each other in a complementary fashion, data analysts can run data wrangling, cleaning, and formula auditing tasks together in a single analysis pipeline.
dc.description.degreeDoctor of Philosophy (PhD)
dc.description.departmentComputer Science
dc.identifier.doihttps://doi.org/10.7275/10678415.0
dc.identifier.orcidN/A
dc.identifier.urihttps://hdl.handle.net/20.500.14394/17221
dc.relation.urlhttps://scholarworks.umass.edu/cgi/viewcontent.cgi?article=2158&context=dissertations_2&unstamped=1
dc.source.statuspublished
dc.subjectdata debugging
dc.subjectspatio-structural program analysis
dc.subjectlayout transformation
dc.subjectprogram synthesis
dc.subjectdebugging
dc.subjectspreadsheet
dc.subjectProgramming Languages and Compilers
dc.titleSpreadsheet Tools for Data Analysts
dc.typeopenaccess
dc.typearticle
dc.typedissertation
digcom.contributor.authorisAuthorOfPublication|email:dbarowy@cs.umass.edu|institution:University of Massachusetts Amherst|Barowy, Daniel W
digcom.identifierdissertations_2/1045
digcom.identifier.contextkey10678415
digcom.identifier.submissionpathdissertations_2/1045
dspace.entity.typePublication
Files
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
thesis.pdf
Size:
5.87 MB
Format:
Adobe Portable Document Format
Collections