UCD researcher discovers racist and sexist terms in MIT AI training data

  • PhD candidate Abeba Birhane from University College Dublin has helped discover several ethical issues in an MIT AI training dataset, including racist and misogynistic slurs.

    Artificial intelligence systems can be trained to recognise a huge variety of objects in images, but an AI is only a product of the data it's trained on. There's been a big push in recent years to increase the volumes of data being used to train AI with little regard for the data quality, and huge models such as the GPT-2 text synthesis AI have yielded some very compelling results.

    A massive library of over 80 million images was compiled by the Massachusetts Institute of Technology in the US back in 2006 with the goal of helping to train AI. The Tiny Images dataset started with 53,000 nouns and then downloaded images from search engines relating to those nouns, yielding a data set of 80 million thumbnails.

    The problem with training AI like this is that it bakes existing algorithmic biases into the model, and we've already seen AI built for tasks such as recruitment turn out to be systematically racist and sexist. As part of a renewed push for better quality training data, UCD's Abeba Birhane analysed the Tiny Images dataset for problems and discovered a number of issues.

    It was found that the list of 53,000 nouns used contained several racial slurs and misogynistic terms, which could cause AI trained on the images to learn those terms and bake in details about how we use them. AI trained on the data would automatically bake in our human biases and stereotypes as it's using image search results based on what we already associate with the words.

    Other problems discovered in the training data included egregious privacy violations, with images of people's children and more scraped from search engines without consent. Images on Google Image Search and other search engines are publicly accessible, but they aren't in the public domain and can't be used without consent of the original creator.

    In response to the paper, MIT has withdrawn the Tiny Images data set and asked AI developers not to use it in their projects. A spokesperson also explained that the images were only 32 x 32 pixels in size and would be difficult for people to recognise. Birhane is continuing to fight for ethics in AI development, and has urged AI developers to consider the impact of their training data on vulnerable people.

    Source: Silicon Republic Image (c) Abeba Birhane

    About the author

    Brendan is a Sync NI writer with a special interest in the gaming sector, programming, emerging technology, and physics. To connect with Brendan, feel free to send him an email or follow him on Twitter.

    Got a news-related tip you’d like to see covered on Sync NI? Email the editorial team for our consideration.

Share this story