This post is also available in: العربية (Arabic)
With over 6 billion (and counting) devices connected to the internet right now, as much as 2.5 million terabytes of data are generated every single day. By 2020, millions of more devices are expected to get connected, projecting an estimate of around 30 million terabytes of data every day.
Statistics, Machine Learning, Data Science, or Analytics – whatever you call it, this discipline is on the rise in the last quarter of the century primarily owing to increasing data collection abilities and an exponential increase in computational power. The field is drawing from the pool of engineers, mathematicians, computer scientists, and statisticians, and increasingly, is demanding a multi-faceted approach for successful execution. In fact, no branch of engineering, science, or business is far from the touch of analytics in any industry. Perhaps you, too, are interested in being, or already are, a data scientist.
What is Data Science?
Data science combines multiple fields, including statistics, scientific methods, artificial intelligence (AI), and data analysis, to extract value from data. Those who practice data science are called data scientists, and they combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights.
Data science encompasses preparing data for analysis, including cleansing, aggregating, and manipulating the data to perform advanced data analysis. Analytic applications and data scientists can then review the results to uncover patterns and enable business leaders to draw informed insights.
Interesting Facts About Data Science
There are many things people know about data science still, there may be several others that might surprise you.
1. Data is Never Clean
A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data. Data preparation accounts for about 80% of the work of data scientists. Data scientists spend 60% of their time cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time preparing and managing data for analysis.
Analytics without real data is a mere collection of hypotheses and theories. Data help test them and find the right one suitable in the context of end-use in hand. However, in real-world data is never clean. Even in organizations that have well-established data science centers for decades, data isn’t clean. Apart from missing or wrong values, one of the biggest problems refers to joining multiple datasets into a coherent whole. And it’s not intentional. Data storage enterprises are designed and tightly integrated with front-end software and the user who is generating data, and are often independently created. Data scientist enters the scene quite late and often is just “taker” of data as-in and not part of the design.
Dirty data is one or more of the following forms –
2. There is No Full Automated Data Science
Since data is not clean and requires quite a lot of data processing, there is no ready set of scripts or buttons to push to develop an analytic model. Each data and problem is different. There is no substitute for exploring data, testing models, and validating against business sense and domain experts. Depending on the problem and your prior experience, you may dirty your hands less, but dirty you will. The only exception is if you get data in a specific format and do the same thing over and over, but that already sounds boring.
3. Big Data is Just a Tool
With the hype around Big Data getting louder every day, I won’t blame you for being enamored of the idea. However, the key thing to remember is that Big Data is just a collection of tools to work with a large volume of data in a reasonable time and with commodity-grade computer hardware. Underlying analytic problem design, modeling best practices, and scrutinizing the eyes of astute analysts aren’t replaceable with Big Data.
That is not to say that competency in Big Data techniques isn’t handy – it is, more so since the world is moving towards Big Data and there may not be “Small” Data in a couple of years anymore. But tools will come and go; your machine learning experience will only persist. Big data is analogous to AK47 rifle for policemen rather than flintlock carbine rifle. Sure, a better tool is preferable to an inferior, but being trained in policing is more important than a rifle.
4. Data Scientists and Data Analysts are Not Same
This is a common myth among people having a superficial idea about data science. The reality is, the work of data scientists and data analysts is totally different. Whereas data analysts work on finding the trends and analyzing the data, data scientists work on finding the cause of a trend and forecasting the upcoming trends. As data science is a new field, popping up certain misconceptions is inevitable.
However, it is worth noting that the two work in tandem. They complement each other and work for a common goal. Now let us check out some of the basic differences between the two.
|Data Scientist||Data Analyst|
|Discovers unexplored questions that may need an answer.||Uses existing information to get workable data on existing questions|
|Skillset: Algorithms, data mining, programming, database management, data analysis, machine learning, predictive analysis||Skillset: Data mining, modeling, programming, statistical analysis, database management, data analysis|
|They estimate the unknown data||They work with known data set|
|They choose to address business problems that would have maximum effect||They address the business problem assigned to them|
|They work at a macro-level||They work at a micro-level|
5. Data Science is Not Just Excel Sheets
Contrary to the aforementioned belief, this one can seem surprising but many people are of the opinion that the life of a data scientist revolves around excel sheets.
This is anything but true. As mentioned before, data science is a vast field with a basic focus on the correct and intended outcome. And to get that outcome, the data science professionals fight tooth and nail. They use different data analytics techniques, SQL query, statistical analysis, predictive analysis, and whatnot.
They do work on excel sheets, but that is just a small unit within their work periphery.
There was once a time when excel sheets played a major role in arriving at a conclusion and making analyses using formulae and calculations. At present with the easy availability of programming tools like Python and R, most data scientists spend a great portion of their time coding rather than on excel sheets.
6. More Data Does Not Always Mean More Accuracy
More data doesn’t mean more insight or more value addition. Using smart data is the key.
Suppose we have a dataset with the exact number of minimum data that is needed to make a correct analysis. This would be an ideal dataset. Now if we add some more data, the entire dataset will need to be reconstructed considering the new set of data as well. While reconstructing, there will be a need to clean the new data and spend time to understand their deviation from the existing set, if any.
Now even after the new data is cleaned and merged to the existing ideal dataset, there is a possibility that some new element is still dirty but unidentified. This will lead to an overall degradation of the final result or analysis.
In this case, lesser data was surely better than more data.
7. Data Science Field has Different Role, Not Just Data Scientists
Many people associate data science with data scientists only, ignoring the other prominent roles belonging to the field.
Data science includes all of these –
- Data Engineers – They are responsible to manage data infrastructure throughout the data science lifecycle. Basic skills include – programming tools like Python, database tools like NoSQL, and big data tools like Hadoop.
- Data Analysts – They find answers to questions by working through the data available, using appropriate tools. Basic skills include – programming, data visualization, statistics, mathematics, and of course data analysis.
- Data Scientist – Data scientists work on big data, analyze it and then communicate the finding through reports and presentations. Basic skills include – statistics, mathematics, programming, data visualization, SQL, Hadoop, machine learning.
8. Data Science is Not Meant Only for Large Organizations
Many businesses believe that data science is meant only for big organizations having high-class infrastructure.
Such belief pops out from a wrong notion about data science. Data science is not made up of machines, heavy tools, or the size of working resources. It perhaps is made up of big data, statistics, analysis, programming, presentation, and some smart people who know how to make the best out of data and add value to the organization. It has nothing to do with big or small organizations.
A data scientist needs to arrive at a result that benefits the company. And no one really cares as to what tools and techniques have been used to achieve that result.
Coming to infrastructure, all that is needed is a computing device, the internet, and some tools that help through the data science life cycle. There are a number of open-source tools available online that can be downloaded to get the ball rolling.
9. Popular Data Science and Machine Learning Languages
A majority of 75% of the survey respondents use Python for work related to Data Science either always or quite frequently. This statistic is in line with the popularity that Python has garnered in recent years and this trend will definitely continue in the area of data science in 2020 and 2021.
On the other hand, R finds itself 2nd on this ladder with 27% of users. R is a scripting language that is powerful yet simple and data scientists coming not from software engineering backgrounds find it easy to adopt R language for their machine learning and data science work.
Data science is becoming inevitable with data explosion in almost every field. It offers a good career opportunity. Thinking of data science as a career option can be a wise decision for anyone who enjoys problem-solving and has data empathy.