Strengthening My Software Fundamentals

read

Data scientists are not software engineers. Yet, in the normal course of data science work, a data scientist must be able to write and use software. Do you know of any great physicists who are bad at math? I don’t. It is universally accepted that great physicists must also be good at writing and using equations to describe their observations. In the same vein, I believe that great data scientists must also be good at writing and using software to implement their ideas.

Tragically, in my data science journey from the swamp to the stars, I’ve found the importance of writing good software to be regularly downplayed and the importance of drawing conclusions to be greatly exaggerated. It makes sense that the data science industry would be this way. Stakeholders never run or check the software, so the conclusions end up being their only window to the work. Even worse, most stakeholders blindly trust those conclusions, not realizing how dangerous this could be. In data science, more so than in any other software discipline, it is common to encounter the worst kind of bugs - those that don’t raise errors but instead mislead you to draw the wrong conclusions. For example:

Two fields in a table are remarkably similar, except for an extremely subtle difference that affects <1% of the data. No documentation exists to explain the subtle difference, and you pick the wrong one for your analysis. It then turns out that fraction of the affected data makes up a significant portion of the data you want to analyze.
You train a machine learning model with many parameters, not realizing that some of the rows you included in your training set are actually duplicates of the rows in your validation set. Your model performs exceptionally well during validation and flops after deployment.

No unit test will ever catch these kinds of errors, so it is of utmost importance that as a data scientist, I write my scripts in such a way that makes it easy to prevent these kinds of errors in the first place. This means writing modular code and choosing the appropriate data structures to store intermediate results. You can never get enough practice. Two months ago, I wanted to learn more about data visualization in a biological context, so I chose an old course that a former colleague of mine taught in 2017 as a starting point. I ended up having to do major refactoring, and in the process, I improved upon my ability to eliminate redundancies. After distilling the notebooks down to the main ideas, it became abundantly clear that this project isn’t much different from any of my other data projects. At its core, the project ingests data, transforms it, then exports figures. The more modular I made it, the easier it will be to reuse it for similar projects in the future.

As a data scientist, I want to be the best that I can be at my craft, and that means sharpening my axe whenever I see it needs to be sharpened. As my then-upcoming interviews loomed around the corner, January seemed like the best time to finally learn data structures and algorithms at the level of a computer scientist, beyond just using simple lists, dictionaries, pandas dataframes and simple control structures. For sure, this wasn’t the hottest thing on my bucket-list of things to do, but having made it over the hill, I can say with certainty that it was well worth it.

What motivated me to set aside other projects and finally study data structures and algorithms is that during an interview at a small gene therapy startup, I was asked a dynamic programming question. Prior to that interview, the most technical questions I encountered during interviews were SQL questions or simple leetcode style questions like reversing an array, so getting a pure programming question at that level of difficulty caught me completely off guard.

After that, in preparation for my interview with Thermo-Fisher, I perused Glassdoor to see if there’s anything in particular I should study up on. I found a review that said that all of the questions they asked can be found by Googling the top 50 programming interview questions, so that’s exactly what I did. I worked my way down the list, giving each question legitimate effort. I looked up the solutions on GeeksforGeeks when I got seriously stuck, which was especially helpful when writing my own implementation of a binary tree, yet despite having the solutions, I seldom straight up copy and pasted the provided solutions. Maybe I’m OCD, but it bothers me a lot when I see functions that don’t include return statements, multiple array indices named i, j, and k, and binary tree functions that are implemented outside of the BinaryTree class.

Having done this essential part of my learning, some of the skills that will translate directly to my data science work are:

I am vastly better at manipulating array indices. Two tricks I learned to enhance readability are: 1) If a function has multiple array indices, you can eliminate some of them by pushing and popping from a stack, and 2) while traversing 2D matrices, using row and col greatly improves readability over using i and j.
I improved my ability to implement separation of concerns. For example, in implementing print_tree(), rather than writing the inorder traversal into the print function, I found it made a lot more sense to write a recursive_inorder() function that yields the node as the result, then have print_tree() call upon recursive_inorder() and print the result. That way, I can reuse recursive_inorder() for other functions, like search. In working out the longest_common_substring() function, I found it easier to have an external function yield the correct array indices rather than write it into the function.
I learned search algorithms. As a physics major, I hated memorization. I could not for the life of me remember what the difference between bubble_sort and insertion_sort until I implemented them for myself. I’m not going to lie, I had a lot of fun with these mini-puzzles. For bucket_sort() I got creative and implemented a binary search to return the appropriate bucket. For radix_sort() I used a hash table to store the buckets through each iteration. For quick_sort(), implementing a random pivot was especially difficult for me, because the solution provided by GeeksforGeeks was to swap the random pivot with the last element and then use the same algorithm as pivot='last', and this felt oddly unsatisfying. I had to really analyze this by writing out the steps using an example array before I was able to implement the correct solution. Troubleshooting this included making sure I’m writing informative print functions so I can see each intermediate swap as they happened.
I broke some bad habits.

Ironically, for all that effort, I wasn’t actually asked any programming questions by either Thermo-Fisher or Chan Zuckerberg Biohub. Instead, the data scientist at Thermo-Fisher git cloned one of my Metis projects and had me walk him through my code. The data engineer at Chan Zuckerberg was more interested in a high level understanding of my past work with databases and what I would do in different scenarios to implement a file system. Even then, I still think learning data structures and algorithms is worth it and will ultimately lead to me being a better data scientist.

The repository for my interview practice is found here. Like with most of my projects, it comes with a MIT License, so feel free to use it for any purpose.

Disclaimer: Studying programming interview questions does NOT make me an expert at data structures and algorithms and is not a complete substitute for learning the same material in a formal setting. Learning is a lifelong process, and as I do more of these questions, I will continue to update my knowledge.

Strengthening My Software Fundamentals

Harrison Wang

Written by

Harrison Wang

Supported by

Harrison's Blog