Jeannette M. Wing: Fake information is a huge challenge for digital society

Professor Jeannette M. Wing is a world-renowned computer scientist. She is currently the director of the Data Science Institute, Columbia University and a fellow of the Association for Computing Machinery (ACM). Her simple and clear definition of “Computational Thinking” has been widely recognized by the computing science community. She is also an active advocate of Data for Good.

In a recent interview with the Tencent Research Institute on the subject of Tech for Good, Professor Jeannette M. Wing believes that a major difficulty in Tech for Good is that most current data scientists have not received sufficient ethical training. The great technical challenge is that the flood of fake information is difficult to eradicate, which will erode the foundation of trust in the digital society.

Q: As an advocate of Data for Good, how do you understand this notion?

A: I have two meanings for Data for Good. One is using data to solve problems that society faces, like problems of climate change, healthcare, energy and social justice. And data-driven researches in all these fields, like public health, biology, climate science, earth science or social work, have a lot of data. So, if these fields use their data to help solve society’s problems, they’re using data for good. That’s one meaning I attribute to Data for Good.

The second meaning is to use data in a responsible manner. This is where the problem of algorithm bias and fairness comes in. We want to make sure that we collect data and preserve the rights and privacy of the citizens whose data we collect. We want to make sure that we manage other peoples’ data in a responsible way. When we analyze the data, we want to make sure that we’re drawing fair inferences about the people whose data we have analyzed.

So, when I say Data for Good, I mean both data to do good for society and using data in a responsible, fair, ethical, and privacy-preserving way.

Q: When using data to benefit society conflicts with using data in a responsible manner, which one do you think is more important?

A: I don’t think my two notions of Data for Good are in conflict. They’re more complementary. When the conflict arises, it is not in the two senses of Data for Good.

The first sense of Data for Good is about solving societal grand challenges that no one discipline, no one person and no one country can solve. Climate change is a global problem. The same goes for medical treatment: everyone in the world may get cancer. It’s not just one country whose people get cancer. If we can understand how cancer works and treat cancer, it’ll help everyone. So, that’s using data to solve the societal grand challenges.

The second meaning is using data in a responsible manner and this is where fairness, ethics, privacy come into play, and where there can be a conflict. So, a canonical example is something like face recognition technology. Face recognition technology used for good by the police and law enforcement will detect bad guys and criminals. That sounds like a good thing, but face recognition technology can also invade peoples’ privacy, and that seems possible depending on the context, depending on the culture, depending on the politics and the society, could be a bad thing. For instance, in Europe, people take privacy very seriously and I think they would not be happy with using face recognition systems in public places because it would be an invasion of peoples’ privacy. So, here is where the conflict is.

Technology itself is sane, but it’s the use of the technology that could be either deemed good or bad. It’s a value system that a culture or a society places on what is good or bad. What is good in one country may be not good in some other country; what is bad in one country may not be bad in some other country. So, this is where value systems, social norms, culture, and society play a role in even defining or deciding what is good and what is bad. This is true of nuclear weapons. It is true of guns.

Q: If we want to treat cancer and need personal information to pursue technological progress, we need to further develop technology after ensuring the reasonable use of personal information. So, can we say that using technology in a responsible way is more important than developing technology itself?

A: I don’t think one would say when some thing’s more or less important. I think actually it’s an edge case that you’re describing in the sense that we actually have privacy-preserving technologies that would allow hospitals, for instance, to share patient data without revealing patient information, but it’s not true that technology can always solve this problem.

When you can’t invent a technology to solve privacy issues, you need to put policies or guidelines and/or best practices into place. And then there’s a judgment call. Whoever puts the guideline, policy or best practices into place is weighing one thing as more important or better than the other thing. That is a judgment call.

Q: You mentioned the protection of privacy and transparency at the World Artificial Intelligence Conference. How can we ensure all these in real-world practices?

A: I think it depends on which property it is. Some properties like fairness can actually be formalized. There are multiple notions of fairness, and each one of them can be formalized. Once you can formalize a property, then you have the hope of automatically determining whether a machine-learning model or an algorithm is fair or not.

I think it’s important to remember that when people talk about biased machine-learning models, the bias comes from the data that’s used to train the models. The data comes from usually historical data, and historical data usually reflects societal biases. So, if society has a bias and has been making decisions for decades, that bias will be in the data. Then the biased data will train the machine-learning model and the model will then be biased.

But what is possible is to anticipate that a machine-learning model is biased and try to detect whether it’s biased with respect to some formal notion of fairness. And then you can fix the model, and you can collect more data. You could do all sorts of possible things to de-bias the model and the decision system.

Right now, we don’t have automated means of detecting whether a model is biased or not. We only have ways to give engineers guidance to try to test out your model on different datasets to see if you’re getting biased decisions.

If you’re getting biased decisions, then you have to fix the model or collect more training data. That’s the best we can do for now. The state of art is that we are only realizing that machine-learning models can be biased. That’s good. We are now aware of that. So, the next step is to do something about it, and ideally, we’d like to do something automatedly. Otherwise, it’s up to the human being to fix a biased model tediously. And all of what I said is about research.

Some properties cannot be so easily formalized, like ethics. Philosophers have debated ethical questions for centuries and no mathematical formula captures a particular ethics principle. It’s more of qualitative judgment and what social norms and cultural differences speak to. So, it would be harder to instruct a machine or determine whether a machine or machine-learning model is ethical or not.

Q: What do you think are the good or bad examples of Data for Good?

A: There are many projects that use data to try to understand cancer, for instance, or try to help patients, treat patients, or try to understand how the climate actually changes. So, in terms of the first notion of Data for Good, there are a zillion examples, because every discipline has data and every discipline is using data to try to solve problems.

Q: To practice Tech for Good, what do you think a technology company should do?

A: I think the first thing is that you need to make sure your customers can trust you. A technology company like Tencent collects a lot of data about people and you also use that data to provide services to people. The customers need to know that they can trust you with their data.

When technology companies are collecting users’ data, they’re taking care of it, and they are not sharing it without their permission. If the state where a company is located allows the company to share data with others, the company must know that while going by the rules, and in the end, it’s about customer trust. The minute a technology company does something where a customer’s data is used in a bad way or revealed and embarrasses a customer, it’ll be bad PR for the company. Customers then may not want to use the company’s services and that will be bad for its business.

So, it’s important for a technology company to ensure trust by being a good custodian of their data. It’s a contract between the company and its customers because customers come to expect that you will treat them and their data in a certain manner and you can’t breach that contract.

If the technology company breaches or wants to change the policy, it has to tell the customer. Suppose all of a sudden that a company decides to share data with other companies or with the government. It’s best for the company to tell the customer. Customer has to know how you use their data.

And in terms of the engineers at technology companies, because there are no automated tools to determine whether something is fair or ethical, the best thing they can do is to come up with a set of principles, and then with respect to a set of principles, they have a set of guidelines. They kind of operationalize what the engineers have to think about when they’re collecting data, storing data, managing data, analyzing data, and then in the end, producing outputs or services for the customers.

And internally, you can think of setting up processes like an internal review board. Suppose a technology company wants to put out a new app and that new app depends on collecting data from a certain population of people, then the engineers should have to pass a review board which makes sure that the engineers collect the data and store the data properly, put all the security and privacy safeguards in place so that only the people who need access to that data get access to that data. Then if they’re building a machine learning model, they can test it for bias because that would be in the guidelines.

So the first thing a tech company should do is to come up with a set of principles, just like Microsoft’s six principles. Maybe you can copy Microsoft, or maybe you choose your own. Then with respect to the principles, you have a set of guidelines for the engineers, and then there’s a review board that can audit the procedures to make sure the engineers have followed the guidelines.

Q: To practice Tech for Good or Data for Good, what is the biggest obstacle for tech companies?

A: In terms of Tech for Good, I think the current biggest obstacle is that a lot of issues like fairness, ethics, transparency and so on, are the ones that engineers are not trained to think about.

I’m a computer scientist. When I went to school, I never took an ethics course. Engineers typically don’t have to think about it. They have a problem, and they’re supposed to solve the problem. They build some technology. They solve the problem. They move on.

But now, because there’s such societal awareness of how this data could be used for bad, I think the biggest obstacle is making sure that the engineers are sensitive to these issues of fairness, ethics and privacy. This is something that they’ve not been trained to do.

Also, I have a different answer for what I think is the biggest technical challenge that we did not anticipate and I don’t have a good solution. That challenge is how easy it is to fake information.

Digital information is easy to change and disseminate. You change one bit and you change a file. You change a piece of news. You change an image. You change information and anyone can do it. It’s so easy because digital information is just electrons. It’s just a bit. It’s not a physically manufactured object which you would have to hit with a hammer or re-case, you know, to change.

Digital technology is very different from any other kind of technology, and what we in the digital technology community did not anticipate is how easy it is for bad people to take advantage of those features and disseminate information that was false.

It has gotten to the point where anyone can write anything. And therefore, what you read on the internet is always suspicious. You don’t know what you’re reading is actually true. You don’t know what information you’re getting is actually true, and this instills distrust by society in the information.

Social institutions (such as the media), government agencies, banks, and technology companies are at risk of disruption. And this is where a trust relationship is so important. If customers trust a technology company, they’re going to believe everything that is on the company’s website, even though you and I know it’s very easy to corrupt it.

So, this to me is the hardest technological challenge we have created, and I don’t know of any technology that can solve this problem. This is because the attack is not on the data itself, but rather on our brains, our cognition. The people who are trying to fool us are really trying to manipulate our understanding of the world. And I don’t see an easy technical solution for that.

I think all of us didn’t realize how difficult a problem that is. How do you stop it? It’s difficult because you squash it in one place, it pops up somewhere else. This problem is the same as the conflict caused by face recognition that we talked about earlier, and here’s another example of a judgment call.

For example, a technology company running a news service may have some news that could cause harm to a population or could be interpreted or could instill and incite a lot of violence. In this case, do you suppress that? Will it affect the freedom of expression? Therefore, the conflict is very difficult to resolve.

Q: What do you think of the different roles played by academia, industry, and government in the practice of Tech for Good?

A: I think that the government, academia and industry all play a role in the advancement of science and the technology innovation ecosystem. So, academia’s role is to do long-term basic research, to pose problems that we may not encounter for another 10 or 20 years, but that’s what long-term means. Academia’s role is also to train the next generation of students–the next generation of employees that industry will hire.

Industry’s role is to take ideas and to scale them up to provide products and services for people, society and consumers. Their role isn’t really to advance the state of the art, which is what research is. Their role is really to create products and services to benefit society and, of course, to make money.

They hire people that graduate from academia and the government’s role is to help fund basic research. So, in academia, one has to get money to do research in order to pay for graduate students, faculties and laboratories. Usually, it’s the government that pays for that basic research.

So, It’s a virtuous cycle. In the ecosystem, each sector relies on each other and each sector feeds each other. Academia produces trained talents who then work for industry; the industry manufactures and sells products to consumers, and thus pays its employees. In academia, in order to do the research, you need the money and that money usually comes from the government. And the government’s benefit is that in industry, while they make money, they also pay taxes to the government. So, those taxes help go to fund the research that academia does.