Infotron’s founders work in science every day. Arie van Deursen is Professor Software Engineering. Felienne Hermans is Assistant Professor and focuses entirely on spreadsheets. In January 2013 she finished her dissertation “Analyzing and Visualizing Spreadsheets”. A short presentation about spreadsheet research and the summary of Felienne’s dissertation.

Strata Conference London, 2013

Analyzing and Visualizing Spreadsheets

Spreadsheets (mostly Excel) are used extensively in industry: they are the number one tool for financial analysis and are also prevalent in other domains, such as logistics and planning. Their flexibility and immediate feedback make them easy to use for non-programmers.

Difficult to analyze and adapt

But as easy as spreadsheets are to build, so difficult can they be to analyze and adapt. This dissertation aims at developing methods to support spreadsheet users to understand, update and improve spreadsheets. We took our inspiration for such methods from software engineering, as this field is specialized in the analysis of data and calculations. In this dissertation, we have looked at four different aspects of spreadsheets: metadata, structure, formulas and data.

Metadata

In spreadsheets, users can mix real data, such as ‘Mekelweg 4’ with so-called metadata: data that is used to describe data, such as ‘Address’. When you want to know what a spreadsheet does, this metadata can be more relevant than the actual data. This is why we have developed an approach to extract this metadata.

Class diagrams

We detect metadata by looking at common places where it is usually found, such as in the first row or column. These common places are called ‘patterns’ and we defined four different version of them. If a pattern in found, it can subsequently be transformed into a class diagram, which describes the spreadsheet metadata in a structured way. These class diagrams can be used to re-implement, refine or improve the corresponding spreadsheets.

Supporting end-users

Initially, these diagrams were intended to be used by (non end-user) software developers. However, it was when applying this research in practice that I found that representing spreadsheets with diagrams could also support end-users.

Structure

Zoom in on a single worksheet and discover which formulas cause worksheet dependencies.

This is why subsequently, we started to investigate the information needs of endusers. In this research, we found that the most important information need of spreadsheet users is to get a quick overview of how the worksheets of a spreadsheet are connected.

Dependencies

Dataflow diagrams proved to be a very viable method of visualizing the dependencies between worksheets, as shown by a case study we performed at Robeco, a large Dutch investment bank. As one of the subjects in our case study said:“The global view reminds me of the plan I had when building this spreadsheet. Normally, I just go through the worksheets from left to right, but that is not the logical way.”

Formulas

When we observed spreadsheet users interacting with these visualizations, we found that they used them not only for understanding spreadsheets, but also for assessing their quality.

Inter-worksheet smells

This is why we formalized this formerly informal assessment of spreadsheet quality by introducing inter-worksheet smells. Since we asserted that worksheets and their connections strongly resemble classes and their relations, we decided to use inter-class code smells as our starting point.

The subsequent evaluation showed that annotating the dataflow diagrams helps spreadsheet users to judge spreadsheet quality even more. One of the subjects stated, upon being confronted with the annotated dataflow diagram: “I should really take some time to improve these formulas, since this is already confusing for me, so it must be terrible for others”.

Individual formulas

While researching the topic of smells, we noticed that users also felt the need to assess individual formulas for maintainability. This is why we expanded our idea of spreadsheet smells into the realm of spreadsheet formulas. Again we took our inspiration from existing code smells, yet this time at the intra-class level. With this approach, we were able to locate complex formulas in a spreadsheet. Users in the subsequent case study with 10 professional users, again conducted at Robeco, confirmed that the selected formulas were indeed the least maintainable and could be improved: “I understand why this formula is selected by your system, it is quite long.” Another subject, when looking at a formula that referred to no less than 17 ranges said “this formula is a real puzzle”.

Data copy paste

Finally, we researched the applicability of clone detection within spreadsheets. Cloning, or copy-pasting, can diminish understandability, since it can be unclear for the user what the real source of data is. Furthermote cloning can be errorprone, since in the case of an update, this update has to be performed on all copies.

Clone detection approach

When validating our clone detection approach, we found that both exact clones (those matching 100%) and near-miss clones occur in spreadsheets and that both pose different threats. Exact clones mainly impact the user’s perspective of spreadsheets (“I did not know these values were copied from hat source”), while near-miss clones really cause trouble (“this value should have been updated months ago”).

Source of errors

We learned that cloning can be a source of errors, as even creators of the spreadsheets did not know all relations by heart. In one of the case studies, our approach was able to detect and repair severe errors in spreadsheets used for the inventory of the south-Dutch foodbank. This has shown that clone detection in spreadsheets can solve real-life spreadsheet problems.

Conclusion

We found that methods from software engineering can be applied to spreadsheets very well, and that these methods support end-users in working with spreadsheets.

Author, copyright: Felienne Hermans 2012

“Analyzing and Visualizing spreadsheets”, ISBN 9789088915680