Introduction
The State of Data Science document was created by Anaconda this year to gather demographic information, domain related issues and trends in data science community, as well as collect insights into big questions and trends that are top of mind within the data community.
In this report it was decided to focus on more actionable issues and concerns within the data science, machine learning, and artificial intelligence industries instead of covering the COVID themes.
I have taken out key aspects from the report to raise the concerns and issues of data scientist faced in 2022.
The raw data from the State of Data Science survey be made available to the public in the spirit of democratising data.
Here is the link: https://www.anaconda.com/state-of-data-science-report-2022
About The Survey
A total of 3,493 people from 133 countries and regions took part in our online survey.
Students, academics, and those working in commercial environments were the different tracks of the respondents.
Some of the questions were universal while others were unique to each cohort.
The set was skewed towards younger generations.
Demographics of Data Scientist
Who is an average data scientist?
The majority 66% of respondents of the 3,493 respondents were either Generation Z or Millennials.
Generation Z (47%) are between 26–41 year old or Millennials (19% ) are between 18–25 years old.
The gender diversity split is 23% women, 76% men and 1% non-binary.
Work Environment
Where do data scientists work?
55 % of respondents work for companies with 1,000 employees or less.
About Quarter (24%) of commercial respondents worked in a data science department, while 22% worked in R&D and 18% work in IT
Sometimes there was an entire data-focused team and other times it is just data scientists working in other departments.
Work Routine
How do data scientists spend their time?
Diverse technical and non-technical skills are required by data professionals on a variety of tasks.
According to the respondents, they spent almost 37% of their time on data preparation and cleansing.
In order to make data actionable and provide answers to critical questions, data visualisation and demonstrating data’s value through reporting and presentation are necessary and that made 29% of their time which includes Data visualisation (13%), reporting and presentation (16%)
The time it takes to work with models through selection, training, and deployment is about 26%
Concerns of data scientists
What concerns data scientists in process of deploying models?
Meeting IT/Info Sec standards (34%), securing data connectivity (28%), and re-coding models from Python/R(26%) to another language are some of the obstacles respondents face when going through the process of deploying a machine learning model in production. Interestingly, Most models (41%) are put into production via an on-premises local server.
40% of commercial-track respondents indicated that their organisation scaled back their open-source software usage in the past year due to concerns around security, and most respondents selected “security vulnerabilities” as the biggest challenge in the open-source community today. To ensure their open-source supply chains and packages are secure and meet enterprise security standards, organisations are using a variety of measures and tools.
What are concerns of data scientists using machine learning?
40% of survey respondents indicated that their organisation had implemented or planned to implement steps to ensure fairness and mitigate bias over the next year. According to internally set standards, the most common step is evaluating data collection methods (30.61%), followed by manually assessing data sets for fairness and bias (24.84%).
23.64% of respondents indicated that their organisations do not have standards surrounding/have not implemented measures or tools to address fairness and bias mitigation in data sets and models, and 14.89% aren’t sure about their organisations’ efforts
What are knowledge gaps in the Data Science that data scientists see?
Engineering skills (38%), probability and statistics (33%), business knowledge (32%), communication skills (31%), and big data management (29%) are the five most important skills missing in the data science/ML areas of their organisations.
What resources are lacking for data scientists who want to learn and develop their skills?
Tailored learning paths (51%), hands-on projects (48%), and mentorship opportunities (48%) are the top tools and resources respondents feel are lacking for data scientists who want to learn and develop their skills.
Why would a data scientist leave their job?
1 in 4 data scientist will leave if they do not have more access to professional training and development opportunities. and more flexibility with work hours as a lot of data science work is deep work.
How concerned are data scientists about skill shortage?
The majority of respondents indicated that their organisations are concerned about the impact of a talent shortage.
What tools are been used by Data Scientists?
The survey sample had 46.83% of commercial respondents indicated their organisations currently use Anaconda. Other popular tools that organisations are currently using include GitHub (44.94%), RStudio* (33.33%), Stack Overflow (31.57%), and Tableau (30.65%).