Working as a data scientist is a fun and rewarding job of obtaining insights from data and applying those insights to solve practical, real world problems. It is often considered to be an interdisciplinary field combining technology, mathematics, and business domain knowledge. So what is it actually like working as a data scientist?
In this blog, we hear from Ying Yap, Data Scientist at Data Mettle, as she highlights some of the key aspects of a data scientist’s role.
__________________________________________________________________________________
COMMUNICATION AND COLLABORATION
Data science usually requires teamwork between the data scientist and business domain experts. The tools and solutions a data scientist creates are ultimately for the goal of solving a business problem or to meet certain business needs. Thus, a project typically begins with communication and collaboration with the relevant teams in the business. It is the data scientist’s job to gather information from stakeholders about their goals and requirements, to understand where and how the data science model will be used, how it fits in with overall business strategies, what business value does it provide, etc. Being able to view a wider picture of the project will then help the data scientist tailor the right solution.
Other teams that a data scientist might collaborate with include data collection teams – to get information on how the data is collected and therefore, understand the data better, data engineers and database support teams, which are critical in assisting a data scientist in acquiring data and navigating large databases. Additionally, embedding a data science model within the company’s IT infrastructure will usually require coordinating with IT support. Successful projects generally involve communication with a range of technology and business teams.
PROGRAMMING
A great deal of a data scientist’s time is spent on programming (and debugging and debugging and debugging!). Once the business problem and project goals are understood, the next step is to dive into the data. Python and R are popular programming languages, and SQL is often used when pulling data from a database. The first steps are generally to investigate the data, pick out any trends, understand things like how much data there is, what information is in the data, is it reliable, are there missing values, and so on. This kind of exploratory work will then inform the data scientist on how to begin about solving the problem.
Raw data often needs to be cleaned or transformed before useful insights can be gleaned from it. This may involve, for example, removing outdated and incorrect data, handling outliers and missing values, dealing with imbalanced classes, and wrangling the data into a format that is usable. Data cleaning is something that a data scientist comes back to over and over again to ensure a high quality dataset. After all, the model will only be as good as the data.
Depending on the goal of the project, machine learning algorithms or statistical techniques are applied to the data. These could be supervised learning methods (e.g., Linear/Logistic Regression, Gradient Boosting Trees, Support Vector Machines, Neural Networks, etc.) or unsupervised learning methods (Principle Component Analysis, clustering algorithms, anomaly detection methods, etc.). Natural Language Processing algorithms are used when dealing with textual data to solve problems such as machine translation, sentiment analysis, and text classification. Not all problems require complex methods – sometimes a simple solution combined with high quality data is sufficient. When designing a solution, a data scientist may test out different algorithms, optimise parameter choices, and validate the model through various testing methodologies. A good model is one that is able to generalise to new unseen data, and doesn’t just work on the training dataset. Above all, no matter the level of sophistication, a good model is one that is useful to the business!
RESEARCH AND ANALYSIS
A core aspect of the job is to analyse and investigate the data. This is an iterative process of forming hypotheses on the data, testing them out, and interpreting the results. Are the right features included in the model? Is a portion of the data skewing the results? Is this algorithm right for this problem? As a data scientist, you would be spending a lot of time ruminating about the problem and devising ways to answer the questions you pose.
Because data science is a very broad field, there are always new technologies to learn and new domain knowledge to acquire. Picking up new skills is part of the process. You might need to learn how to code a new algorithm, to study the mathematics behind it, or to learn how to use a software system you have not used before. Understanding the data and algorithms you use is key to building a good model that can add value to the business. The good news is that the more knowledge you pick up, the more tools you will have at your disposal for the next project.
SUMMARY
In a nutshell, the job of the data scientist is to solve a business problem using the data and any tools and algorithms available. Because every set of data has different characteristics, every project is a new problem to tackle and it never gets boring or routine. It isn’t all about the technical challenge though. Adopting a business-focused approach and understanding exactly what the business requires is vital to the success of a project. Being able to communicate technical details to a non-technical audience, deliver presentations effectively, and collaborate with other teams is also part of the job. It is always incredibly rewarding to be able to solve real-world problems using scientific methodologies!
_________________________________________________________________________________
Would you like blog posts sent directly to your inbox? Join our mailing list and be kept up-to-date with all WA Data Science news.