8 kills you should learn to be a data scientist
7 Types of Data Scientist
1) Data Scientist as Statistician
This is data analysis in the traditional sense. The field of statistics has always been about number crunching. A strong statistical base qualifies you to extrapolate your interest in a number of data scientist fields. Hypothesis testing, confidence intervals, Analysis of Variance (ANOVA), data visualization and quantitative research are some of the core skills possessed by statisticians which can be extrapolated to gain expertise in specific data scientist fields explained in following section of this article.
Statistics knowledge, when clubbed with domain knowledge (such as marketing, risk, actuarial science) is the ideal combination to land a statistician’s work profile. They can develop statistical models from big data analysis, carry out experimental design and apply theories of sampling, clustering and predictive modelling to available data to determine future corporate actions.
2) Data Scientists Vs Data Engineers
A data engineer’s role is very different from that of a data scientist. A data engineer has the responsibility to design, build and manage the information captured by an organization. He is entrusted with the job of putting in place a data handling infrastructure to analyse and process data in line with an organization’s requirements. Additionally, he is also responsible for its smooth functioning. They need to work closely with data scientists, IT managers and other business leaders to translate raw data into actionable insights which would result in competitive edge for the organization.
3) Data Scientist as Machine Learning Scientists
Computer systems around the world are increasingly being equipped with artificial intelligence and decision making capabilities. They possess neural networks that are programmed for adaptive learning – meaning they can be trained over a period of time to make same decisions when same set of inputs is given to them. Machine Learning Scientists develop such algorithms which are used to suggest products, pricing strategies, extract patterns from big data inputs and most importantly, demand forecasting (which can be extrapolated for better inventory management, strengthening supply chain networks, etc.).
4) Data Scientist as Business Analytic Practitioners
Businesses make the final use of all the number crunching done by data science professionals. As a business analytic professional it is important to have business acumen as well as know your numbers. Business analysis is a science as well as art and one cannot afford to be driven entirely by either business acumen or by insights obtained based on data analysis. These professionals sit between front end decision making teams and the back end analysts.
They work on crucial decision making such as ROI analysis, ROI optimization, dashboards design, performance metrics determination, high level database design, etc.
5) Data Scientist as Software Programming Analysts
Unlike traditional coders, this class of professionals have a knack for number crunching through programming. Needless to mention, they are adept at logical thinking and as a result, they take to new programming languages as ducks takes to water. A number of programming languages such as R programming, Python, Apache Hive, Pig, Hadoop and the like support data analytics and visualizations.
Software programming analysts have the programming skills to automate routine big data related tasks to reduce computing time. They are also required to handle database and associated ETL (Extract Transform Learn) tools that can extract data, transform it by applying business logic and to load it into visual summary representations such as charts, histograms and interactive dashboards.
6) Data Scientist as Digital Analytic Consultant
A Data Scientists needs to be able to define the data in accordance with the business problem – and for this he/she needs to know the business end of the spectrum.
7) Data Scientist as Quality Analyst
Quality Analyst has for long been associated with statistical process control in manufacturing industry. This position has been included here to emphasize the importance of data science in core industries. Assembly lines involved in mass production have large data sets to be analysed to maintain quality control and meet minimum performance standards. The job has evolved over the years with new analytic tools which are used by data scientists to prepare interactive visualizations that serve as key inputs in decision making across teams such as management, business, marketing, sales and customer service.
8 Data Skills to Get You Hired As a Data Scientist
This is the core set of 8 data science competencies you should develop:
Basic Tools: No matter what type of company you’re interviewing for, you’re likely going to be expected to know how to use the tools of the trade. This means a statistical programming language, like R or Python, and a database querying language like SQL. If you use python, there are some popular python libraries for data science and machine learning.
Basic Statistics: At least a basic understanding of statistics is vital as a data scientist. An interviewer once told me that many of the people he interviewed couldn’t even provide the correct definition of a p-value. You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc. Think back to your basic stats class! This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics is important at all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.
Machine Learning: If you’re at a large company with huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. This can mean things like k-nearest neighbors, random forests, ensemble methods – all of the machine learning buzzwords. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a dealbreaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broadstrokes and really understand when it is appropriate to use different techniques. See popular machine Learning Interview Questions.
Multivariable Calculus and Linear Algebra: You may in fact be asked to derive some of the machine learning or statistics results you employ elsewhere in your interview. Even if you’re not, your interviewer may ask you some basic multivariable calculus or linear algebra questions, since they form the basis of a lot of these techniques. You may wonder why a data scientist would need to understand this stuff if there are a bunch of out of the box implementations in sklearn or R. The answer is that at a certain point, it can become worth it for a data science team to build out their own implementations in house. Understanding these concepts is most important at companies where the product is defined by the data and small improvements in predictive performance or algorithm optimization can lead to huge wins for the company.
“Data scientist” is often used as a blanket title to describe jobs that are drastically different.
Data Munging: Often times, the data you’re analyzing is going to be messy and difficult to work with. Because of this, it’s really important to know how to deal with imperfections in data. Some examples of data imperfections include missing values, inconsistent string formatting (e.g., ‘New York’ versus ‘new york’ versus ‘ny’), and date formatting (‘2014-01-01’ vs. ‘01/01/2014’, unix time vs. timestamps, etc.). This will be most important at small companies where you’re an early data hire, or data-driven companies where the product is not data-related (particularly because the latter has often grown quickly with not much attention to data cleanliness), but this skill is important for everyone to have.
Data Visualization & Communication: Visualizing and communicating data is incredibly important, especially at young companies who are making data-driven decisions for the first time or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings or the way techniques work to audiences, both technical and non-technical. Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot and d3.js. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information.
Software Engineering: If you’re interviewing at a smaller company and are one of the first data science hires, it can be important to have a strong software engineering background. You’ll be responsible for handling a lot of data logging, and potentially the development of data-driven products.
Thinking Like A Data Scientist: Companies want to see that you’re a (data-driven) problem solver. That is, at some point during your interview process, you’ll probably be asked about some high level problem – for example, about a test the company may want to run or a data-driven product it may want to develop. It’s important to think about what things are important, and what things aren’t. How should you, as the data scientist, interact with the engineers and product managers? What methods should you use? When do approximations make sense?
Data science is still nascent and ill-defined as a field. Getting a job is as much about finding a company whose needs match your skills as it is developing those skills. This writing is based on my own firsthand experiences – I’d love to hear if you’ve had similar (or contrasting) experiences during your own process.