One of the more common approaches that a freshly-minted data scientists takes when given a project is to immediately fire up their programming framework of choice and apply one of the fancier algorithms they know.
However, this initial tendency starts to diminish as you take a look at the habits of more experienced data scientists – why is that?
A More Big-Picture Approach
First, there’s nothing wrong with experimenting with a wide range of approaches when you are first exposed to a new data set. After all, there isn’t necessarily one true method of getting your bearings when figuring out what exactly the data is saying, at least at a high level.
However, some would argue that applying high-buzzword-content algorithms at the immediate outset of a project isn’t the best way to go. Sure it seems like you’re making progress and sometimes you get really cool graphs – but how do you know whether that graph actually means anything?
Asking Initial Questions
This is where the distinction between rookie vs experienced data scientists starts to come into focus. A more experienced data scientist will take a more circumspect approach when first digging into a problem – for example, what do we know about the precision and accuracy of the data? What’s our general comfort with the cleanliness of the data? Are some variables more prone to error than others, and within what order of magnitude is this error?
Asking questions like these (and perhaps slowing down a bit) can have a major impact on avoid early local minima – or even worse, overfitting a model to an artifact of the data that wouldn’t have been that hard to avoid.
All About The Presentation
Additionally, something that a lot of newer data scientists might not quite adequately appreciate yet is the disproportionate importance of giving a quality presentation for their findings, and tying the findings to real-world business objectives.
In other words, it’s great if you analyzed many terabytes of data with a clever deep learning algorithm, but no one really cares about that – it’s much more important if you can convey what your results actually mean.
Your data may hold tremendous amounts of potential value, but not an ounce of value can be created unless insights are uncovered and translated into actions or business outcomes. -Brent Dykes
It’s easy to get so caught up in the data wrangling process that you forget to leave enough time to put together your analysis presentation – however, this is much less likely to happen amongst the more experienced data scientists, who understand that perception is often more important than reality.
Finally, more experienced data scientists realize that more and/or better data will beat fancy algorithms almost every time.
Often times, the initial data set we’re using to work with doesn’t represent the entirety of data that is reasonably available to us, and by directing maybe a few thoughtful questions to the data source provider, it might not be that difficult to immediately eliminate potential data cleanliness issues or errors.
Limitations of the Data
Asking careful questions about known data issues is a trait common amongst more experienced data scientists, but for the newer data scientists, they’re generally more gung-ho and just want to start digging into the data ASAP.
That enthusiasm is great, but without a basic understanding of the qualities of the data set, many a disappointed data scientist have frustratingly found themselves back at square one.
Practically, this generally means that the more experienced people will start out with a pretty basic visualization routine, just to get a feel for what they’re dealing with. Simple graphs of the data, even in Excel, could quickly lead to the discovery of high-sensitivity data issues that the data provider wasn’t even aware of, but could quickly fix or be adjusted for.