Well, we all are looking towards the benefits of Artificial Intelligence but how often do we discuss the challenges present. In today’s era, AI is deemed as a golden egg-laying goose by businesses and governments. Every day we witness different kinds of breakthrough research and new achievements being unlocked by AI. It seems like we are riding a new wave of revolution which is all set to automate the boring mundane tasks and even more than that. From reading about a driver-less car to seeing one running on the road, the technology has come a long way. However, this is just one side of the story and several limitations often get overshadowed under the promises of the technology. We all know that AI is rapidly growing and research is still ongoing in several fields, this requires our attention. We need to identify the challenges and how we can contribute to the development of the technology. Moreover, business leaders need to equip themselves with knowledge about the limitations of this technology so they can differentiate between the hype and the reality.
Challenge 1: The curse of Data Annotation
The first challenge that I want to highlight is data Annotation or data labeling. A major chunk of machine learning is based on the fact that we train our computers using labeled data. In simpler terms, if I want my computer to look at an image and predict whether it’s a dog or not then first, I need to show my computer what a dog looks like. This will require labeled images of dogs and several other images which don’t have a dog.
The problem here is, who annotates the data? Data Annotation/labeling requires a large human force and big tech giants like Google/Facebook hire a massive workforce who spend hours labeling the data. The irony here is that we are trying to make smarter systems but they require substantial manual labor. This is not SMART. Let’s take an example. Suppose you are trying to build an AI-based system that can detect the damage occurring in the lungs due to coronavirus. For that, you will need several chest x-ray images that have been properly labeled under professional guidance. We can’t even think of the consequences of using improperly labeled data in this case.
Following are some solutions that are being looked up to solve the issue.
- Reinforcement Learning: Reinforcement Learning: The concept of RL works on simple fundamentals of the reward system, which is used to train robots. The idea can be used to design labeling systems that work based on a function that is rewarded every time it labels correctly and is given a negative reinforcement for wrong labeling. A very interesting use case can be a reinforcement-based recommendation system, which is rewarded whenever a recommendation-driven sale occurs.
- GANs: Generative adversarial networks (GANs): IT is a semi-supervised technique where two networks compete against each other which eventually leads to the building of refined systems. Suppose you are trying to build a GAN which can generate images of humans who don’t exist, for that you will need two networks. one network distinguishes between fake and real images of humans while the other network confuses it by showing images that look like humans but aren’t. Data labeling systems that work on this technique can significantly reduce human dependency by generating images that seem so real.
- Unsupervised Learning: This is a technique where we have got unlabeled data and we want our machines to learn something from it. Unsupervised Learning is considered a very hard problem because in the absence of labels the computers don’t know what they are being trained for. Unlike supervised learning, where you have got labels and machines know what they have to learn. I would like to shed some light on the amazing work done by Mohammed Terry-Jack where he built an unsupervised learning-based system that takes an image as input and assigns one of nine colors (green, blue, white, yellow, red, etc) to any pixel given its coordinates. So, the image produced after performing the clustering looks quite similar to the original image.
Challenge 2: The requirement of massive training datasets
The current AI-based applications not only require labeled data but also massive data. Let’s look at some of the popular datasets and their size.
These are just a handful of examples from computer vision but there are several others. If you think about the biggest players in AI, which are, Amazon, Google, Facebook, etc they are leading because they have access to so much data. Let me explain this with the same example of an AI-based system that can detect COVID by looking at X-ray images of lungs. Will the system designed on the data of US patients work on Indian patients? The answer is no. The spread of COVID varies according to geographies and hence several other factors have to be taken into account. Hence, to build such a system, which can detect COVID, irrespective of the geography, have to collect a very large dataset.
- Simulated Learning Environment:
This is one technique that can be used to reduce the data size. It works on the principle of training in a virtual environment. Such systems can be built that are pre-trained in virtual environments with different variants of tasks and then learn in ground reality. For eg: suppose there is a robotic arm that has been trained in a simulated environment to pick up paper cups and then it can learn from the real world how to pick up bottles. This will save time and effort as the system has already learned something so it doesn’t have to start from scratch. The technique is in the early phases and a lot of work is yet to be done.
The second solution is transfer learning which I will explain later in the article.
Challenge 3: The problem of data quality
There is a very old saying, Garbage in is equal to Garbage out. This statement resonates more with the field of AI. If your model is fed with bad data then you can’t expect good results. For example, you are building an autonomous car, with a dataset where pedestrians are labeled as streets, what will happen? The car will run over the pedestrians. This is not what we want. According to a survey published in Harvard Business Review, 80% of data scientists’ time is spent cleaning the data and bringing it into the state, where it can deliver results. But even after putting in so much effort, all defects of data can never be corrected. Data Cleaning is a time-consuming process and it’s unique to the problem. For eg: if you have access to online transactions data and you are trying to solve two problems, namely, customer churn prediction and customer segmentation then both use cases will require separate data preparation. This requires a lot of manual efforts and we don’t want all energy to get concentrated on data cleaning.
- Zero Communication Gap: The problem mainly occurs when the team which is responsible for data collection is not aware of the intention with which the data is being collected. Clear transparency should be maintained across the organization and the team of AI professionals should communicate the purpose of collecting data and make the data collection team aware of the consequences that occur when the model is fed with bad data.
- Feasibility Study: Before kicking off a project, some time should be allotted to the data professionals in which they can perform in-depth data quality analysis and check the feasibility. A proper report with the data errors should be prepared and it should be given back as feedback to the clients. This will not only create room for improvement but also allow cleaner data generation.
- Quality Assurance Metrics: Various metrics can be decided on a prior basis and scoring should be done at every data manipulation step. This will ensure that data not only remains accurate but also consistent.
Challenge 4: The problem of Data bias
Works of Julia Angwin and many others who have written about how machine learning algorithms are biased towards black people have proved that our datasets are biased. This might be intentional or unintentional but it is necessary to stand up and become responsible for the models that we build. Let me explain it with an example: suppose you are building a system that tells you whether a given person can be issued a loan or not and you choose a dataset that is heavily skewed in a way that it has several records where men were issued a loan and a very few records where women were issued a loan. This doesn’t mean that women are less capable of availing a loan than a man but is a clear unintentional mistake of data collection. What will be the impact of such a system being implemented? Often these biases get covered under the name of ‘advance analytics’ or ‘proprietary algorithm’ but this should not happen.
With governments becoming more interested in technology, the technocrats should become beware and responsible that these prejudices and biases don’t get implemented.
One more thing that I want to highlight here is the public datasets. Many products and applications are based on the commonly available datasets as they are free to use. For eg: we have CIFAR which is popularly used for object recognition. These datasets are shared in the AI community and are often used for bench-marking algorithms. Imagine if the bias is present in publicly available datasets then it can quickly get replicated and scaled.
- Become more responsible: If you are designing/building these models then think about how they can impact society and try to identify the inherent bias that might be present in the data. Since we are building the models, therefore we are accountable for it.
- Establishment of communities that work for responsible AI: A lot of work is already happening in this space and many people are coming together to form groups that collectively work towards responsible AI. For eg: The International Technology Law Association (ITechLaw) — with the membership of technology attorneys and law firms from more than 70 countries — has published Responsible AI: A Global Policy Framework. This framework is based on 8 principles which are Ethical Purpose and Societal Benefit, Accountability, Transparency and Explainability, Fairness and Non-discrimination, Safety and Reliability, Open Data and Fair Competition, Privacy, and AI and Intellectual Property. But this is not enough, more has to be done and more people need to come forward.
So far, we have talked about the challenges that were somehow related to data. Now, let’s move on to another challenge that’s about explainability.
Challenge 5: The Interpretability Problem
Imagine your teacher gave you zero out of ten in a Math test and when you asked the reason, you got no answer. This is exactly the issue with the black-box AI. The problem is not new and a lot of research is ongoing in the field. This is structurally hard and is becoming harder as more and more complex models take over. An AI-based application can be very accurate and fast but if it can’t tell you the reason behind its prediction then there are a few chances that it will perform well in the market. Let me explain to you with an example. Suppose, you build a system that can detect breast cancer with 99% accuracy and you go to a hospital for the adoption of the system, the first question that will be asked from you might be, “How can your model say that this person is suffering from cancer?” People need to know the factors and how these factors are contributing to the results. This calls for the need for explainable AI. In use cases such as lending a loan or criminal justice, it becomes of utmost crucial to know why a given person can’t be lent a loan or on what basis does the model classifies a given person as a criminal.
There are a few open-source python packages for deep learning and boosting algorithms that help solve this problem. Some of them are :
- Google’s WIT: What-if is abbreviated as WIT and is a very handy tool that enables users to evaluate their ML models. WIT has a very user-friendly GUI and hence it’s easy to use. The major attraction of this tool is that it allows you to visualize inference results and gives you the freedom to change the data and see its impact.
- AIX: IBM’s AI explainability 360 is an open-source toolkit that allows you to comprehend the ML models. It contains 8 SOTA algorithms that can help decode the prediction results:
- LIME: Local Interpretable Model-Agnostic Explanations (LIME) is another popular framework that is used for answering the questions of ‘why’ and solving the black box mystery. The output of LIME is a list of explanations, reflecting the contribution of each feature to the prediction of a data sample. This provides local interpretability, and it also allows to determine which feature changes will have the most impact on the prediction.
Suppose you are trying to build a system that can predict by looking at an image whether it’s a human face or not, then these tools will help you identify the features which contribute the most to the human face, these features can be eyes, noses, cheeks, etc.
Challenge 6: The issue of Generalizability
The whole idea of Artificial Intelligence is to build smart systems that can mimic the human brain and perform tasks with precision greater than or at least equal to human beings. So, how far have we reached in the process of creating a replica of the human brain, I think that we are not even close. The main reason behind it is the ability of humans to learn from one situation and apply that learning to another different situation. This is the idea of generalizability, the capability to gain knowledge from one problem and use it in a different scenario. Suppose you have an autonomous vehicle that has been well trained to run on the roads of the USA, will the same vehicle be able to run on Indian roads? The answer is no. The vehicle has never seen Indian traffic conditions and it’s not a generalized system that can use its learning of USA traffic and apply here in India. The ability to generalize can help in managing the resources efficiently as the models will not have to be re-trained. Imagine if we had access to such generalized systems in today’s scenario, then we could have created a model which was trained on the data of the Spanish flu and use the model in the current situation of COVID-19. How helpful it would have been with the process of drug discovery and predicting the spread of the virus. That’s the power of a generalized system.
So, what are the solutions :
- Transfer Learning: This is one technique that is currently being used and prevents training the model from scratch. So, the basic idea is, you take a model that has already been pre-trained on a very large dataset and you make very minor tweaks in the model as per your requirement. This saves both time and resources as the model doesn’t have to learn everything and most of the work is done. For eg: you want to build a model that can identify truck images then you can take a model which is already trained on car images and make some minor changes to it. This will need fewer data points and will give quite better results.
- Capsules: This concept is still in the early phases of development and a lot of work is being done by Geffery Hinton in this space. The idea is to encapsulate the learning obtained from one system and then embed it in a different system so that it can be used to solve a different problem. But as I said, the technique is still in the early phases and requires a great detail of work to be done. The basic difference between a regular network and a capsule network is that in a capsule, we have nested layers that can retain the knowledge while in the regular networks we keep adding the layers one after the other.
To summarize, we have six different technical limitations and all these limitations invite us to solve them for a better future. We have to identify more solutions and solve these challenges. I hope you enjoyed the article. I am always looking for feedback, please let me know if you have one.