Visualising data science workflows to support third-party notebook comprehension: an empirical study

Understanding third-party code is complex. Data science code that results from an exploratory and iterative process compounded with other issues, such as poor documentation and bad coding practices, can make comprehension hard.

In our work, we investigate how workflow visualisation can help data scientists understand third-party data science notebooks.

1) First, we provide empirical evidence for the existence of non-linearity in data science notebooks. 2) Second, we propose a graph-based visualisation method to elucidate the implicit workflow information in data science code implemented in Jupyter notebooks. The visualisation provides information such as the rationale and identification of data science steps for each node. The goal of the visualisation is to assist data scientists in navigating the so-called ‘garden of forking paths’ present in data science workflows. 3) Finally, we conducted an empirical study to evaluate the method (implemented in the form of a Jupyter plugin called MARG) and its effect on data science notebook comprehension. Our empirical study shows that the proposed visualisation helps the users in getting an overview of the notebook and significantly improves comprehension. Our result on the effort required to complete the comprehension task, measured in terms of time taken, is inconclusive. 4) We provide further insights into the difficulties faced and strategies used during notebook comprehension using a comprehensive qualitative analysis. We also present a SUS analysis on the visualisation plugin.

Data Science Workflow Visualisation: MARG

Our qualitative analysis shows that i) users’ perceived comprehension does not match their performance, ii) users prefer information that is less cognitively heavy, iii) the majority of the participants generally followed skimming as a first-step strategy to comprehend the notebook, iv) users face several obstacles during data science notebook comprehension (Top two: missing narrative text, code related issues)

We also further discuss in this work the challenges and opportunities for future research in supporting data scientists in their development.

Read more at “Visualising data science workflows to support third-party notebook comprehension: an empirical study”. (Open Access)