    OpenAI’s GPTBot and other AI web crawlers are being blocked by even more companies now

    Hundreds of major companies and websites are now blocking ChatGPT’s web crawler.Dozens more are also now blocking the crawler of Common Crawl, a major source of AI training data.Unique, high quality data, mainly scraped from the web, is vital to the performance of AI models.

    More and more companies are trying to avoid having their data freely scraped and saved by web crawlers working for the benefit of AI models.

    Last month, OpenAI last revealed its own crawler, GPTBot, saying it would respect robots.txt, a decades-old method through which a website can tell a web crawler to ignore it. About 70 of the 1,000 most popular sites blocked it, including Amazon and Tumblr.

    This week, Insider got new data on this from Originality.ai. It shows that, over the course of about three weeks, the number of top sites blocking GPTbot has jumped to more than 250.

    The list of new GPTbot blockers includes Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and what appears to be all titles published by Hearst and those by Conde Nast. Even weather.com is blocking the bot.

    Unique and accurate information is vital to the performance of generative AI models like OpenAI’s GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright. A growing awareness of the practice has led to several lawsuits, and new government rules and regulations could be on the way.

    Many more companies are now also blocking CCBot, a web crawler used by Common Crawl. Based in Europe, Common Crawl has spent years collecting massive amounts data from the web, including stuff under copyright, and organizing the datasets for use as free training data for large language models such as Meta’s Llama. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, according to data from Originality.ai.

    Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic. Many of those blocking CCBot also block GPTBot. Although it seems ChatGPT’s notoriety has caused more companies to block its crawler, despite CCBot likely being active over a longer period of time.

    While online businesses have been deploying robots.txt to try and stop their data being taken to train AI models, many tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for use in AI projects and training.

    See below for a full list of the biggest websites now blocking GPTBot and CCBot as of Sept. 22:

    Blocking GPTBot

    amazon.com

    quora.com

    nytimes.com

    theguardian.com

    shutterstock.com

    wikihow.com

    cnn.com

    sciencedirect.com

    usatoday.com

    healthline.com

    stackexchange.com

    alamy.com

    scribd.com

    webmd.com

    businessinsider.com

    dictionary.com

    reuters.com

    washingtonpost.com

    medicalnewstoday.com

    npr.org

    cbsnews.com

    goodhousekeeping.com

    amazon.co.uk

    tumblr.com

    latimes.com

    insider.com

    glassdoor.com

    vocabulary.com

    investopedia.com

    slideshare.net

    amazon.de

    cosmopolitan.com

    nbcnews.com

    indiamart.com

    stackoverflow.com

    hindustantimes.com

    bloomberg.com

    cnbc.com

    people.com

    tvtropes.org

    amazon.in

    vimeo.com

    verywellhealth.com

    ikea.com

    espn.com

    indianexpress.com

    thesaurus.com

    pbs.org

    123rf.com

    wattpad.com

    variety.com

    today.com

    popsugar.com

    thespruce.com

    uol.com.br

    amazon.fr

    geeksforgeeks.org

    elle.com

    economictimes.com

    pcmag.com

    theverge.com

    allrecipes.com

    thoughtco.com

    rollingstone.com

    wired.com

    nextdoor.com

    hollywoodreporter.com

    abc.net.au

    ew.com

    amazon.ca

    news18.com

    womenshealthmag.com

    rateyourmusic.com

    amazon.co.jp

    techradar.com

    airbnb.com

    ndtv.com

    lifewire.com

    tomsguide.com

    vulture.com

    everydayhealth.com

    polygon.com

    theconversation.com

    esquire.com

    prnewswire.com

    billboard.com

    menshealth.com

    metro.co.uk

    countryliving.com

    mashable.com

    gamesradar.com

    thehindu.com

    timesofindia.com

    deadline.com

    harpersbazaar.com

    medscape.com

    nymag.com

    refinery29.com

    radiotimes.com

    cbssports.com

    tandfonline.com

    theatlantic.com

    trulia.com

    amazon.es

    pinterest.es

    nationalgeographic.com

    bhg.com

    eater.com

    southernliving.com

    healthgrades.com

    vice.com

    picclick.com

    bustle.com

    newyorker.com

    eonline.com

    digitalspy.com

    opentable.com

    pinterest.de

    thepioneerwoman.com

    caranddriver.com

    byrdie.com

    livemint.com

    medicinenet.com

    teacherspayteachers.com

    cookpad.com

    thespruceeats.com

    bizjournals.com

    pagesjaunes.fr

    liputan6.com

    delish.com

    masterclass.com

    archiveofourown.org

    vox.com

    realsimple.com

    aarp.org

    francetvinfo.fr

    pinterest.fr

    kumparan.com

    theathletic.com

    travelandleisure.com

    vogue.com

    livescience.com

    apartments.com

    marketwatch.com

    glamour.com

    amazon.it

    cinemablend.com

    thrillist.com

    amazon.com.br

    pinterest.co.uk

    angi.com

    alamy.es

    usmagazine.com

    distractify.com

    bbcgoodfood.com

    jagran.com

    mercadolibre.com.mx

    androidauthority.com

    city-data.com

    foodandwine.com

    hellomagazine.com

    amazon.com.au

    gq.com

    ingles.com

    amarujala.com

    ieee.org

    prevention.com

    stern.de

    kbb.com

    edmunds.com

    marthastewart.com

    pcgamer.com

    justanswer.com

    health.com

    20minutes.fr

    fortune.com

    homes.com

    scientificamerican.com

    popularmechanics.com

    verywellfit.com

    vanityfair.com

    chicagotribune.com

    verywellmind.com

    housebeautiful.com

    cntraveler.com

    allure.com

    spanishdict.com

    neverbounce.com

    answers.com

    moneycontrol.com

    architecturaldigest.com

    slate.com

    lonelyplanet.com

    inverse.com

    corriere.it

    actu.fr

    self.com

    tripsavvy.com

    instyle.com

    eatingwell.com

    superuser.com

    welt.de

    spiegel.de

    womansday.com

    seventeen.com

    hbr.org

    oprahdaily.com

    autotrader.com

    bonappetit.com

    sueddeutsche.de

    seriouseats.com

    liveabout.com

    seattletimes.com

    coursera.org

    livehindustan.com

    france24.com

    townandcountrymag.com

    dotesports.com

    worldplaces.me

    faz.net

    teenvogue.com

    motor1.com

    nj.com

    glamourmagazine.co.uk

    okdiario.com

    brides.com

    stylecaster.com

    alamyimages.fr

    jagranjosh.com

    theglobeandmail.com

    axios.com

    francebleu.fr

    tabelog.com

    thebalancemoney.com

    nydailynews.com

    sheknows.com

    naomedical.com

    verywellfamily.com

    Blocking CCBot

    nytimes.com

    shutterstock.com

    reuters.com

    goodhousekeeping.com

    tumblr.com

    cosmopolitan.com

    pixabay.com

    depositphotos.com

    pbs.org

    elle.com

    glosbe.com

    patch.com

    wired.com

    womenshealthmag.com

    esquire.com

    indiatoday.in

    menshealth.com

    countryliving.com

    zippia.com

    chron.com

    harpersbazaar.com

    tr-ex.me

    detik.com

    theatlantic.com

    newyorker.com

    digitalspy.com

    etymonline.com

    thepioneerwoman.com

    caranddriver.com

    hinative.com

    teacherspayteachers.com

    delish.com

    masterclass.com

    archiveofourown.org

    theathletic.com

    vogue.com

    glamour.com

    alltrails.com

    gq.com

    ingles.com

    prevention.com

    kbb.com

    popularmechanics.com

    vanityfair.com

    housebeautiful.com

    cntraveler.com

    allure.com

    spanishdict.com

    architecturaldigest.com

    self.com

    sfgate.com

    womansday.com

    songkick.com

    seventeen.com

    oprahdaily.com

    autotrader.com

    bonappetit.com

    aajtak.in

    coursera.org

    townandcountrymag.com

    faz.net

    teenvogue.com

    glamourmagazine.co.uk

