Fri. Nov 15th, 2024

OpenAI’s GPTBot and other AI web crawlers are being blocked by even more companies now<!-- wp:html --><p>Sam Altman, the OpenAI CEO, and an illustration of GPT-4.</p> <p class="copyright">JASON REDMOND/AFP via Getty Images; Jaap Arriens/NurPhoto via Getty Images</p> <p>Hundreds of major companies and websites are now blocking ChatGPT's web crawler.Dozens more are also now blocking the crawler of Common Crawl, a major source of AI training data.Unique, high quality data, mainly scraped from the web, is vital to the performance of AI models.</p> <p>More and more companies are trying to avoid having their data freely scraped and saved by web crawlers working for the benefit of AI models.</p> <p>Last month, OpenAI last revealed its own crawler, GPTBot, saying it would respect robots.txt, a <a target="_blank" href="https://www.businessinsider.com/ai-killing-web-grand-bargain-2023-8" rel="noopener">decades-old method</a> through which a website can tell a web crawler to ignore it. About 70 of the 1,000 most popular sites <a target="_blank" href="https://www.businessinsider.com/chatgpt-openai-gptbot-crawler-major-companies-media-outlets-blocking-2023-8" rel="noopener">blocked it</a>, including Amazon and Tumblr. </p> <p>This week, Insider got new data on this from <a target="_blank" href="https://originality.ai/blog/study-websites-blocking-gptbot" rel="noopener">Originality.ai</a>. It shows that, over the course of about three weeks, the number of top sites blocking GPTbot has jumped to more than 250.</p> <p>The list of new GPTbot blockers includes Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and what appears to be all titles published by Hearst and those by Conde Nast. Even weather.com is blocking the bot.</p> <p>Unique and accurate information is vital to the <a target="_blank" href="https://www.businessinsider.com/how-to-create-ai-materials-resources-guide-2023-9" rel="noopener">performance of generative AI</a> models like OpenAI's GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright. A growing awareness of the practice has led to several lawsuits, and <a target="_blank" href="https://www.businessinsider.com/us-copyright-office-new-rules-generative-ai-2023-8" rel="noopener">new government rules</a> and regulations could be on the way.</p> <p>Many more companies are now also blocking CCBot, a web crawler used by Common Crawl. Based in Europe, Common Crawl has spent years collecting <a target="_blank" href="https://www.businessinsider.com/ai-killing-web-grand-bargain-2023-8" rel="noopener">massive amounts data</a> from the web, including <a target="_blank" href="https://www.businessinsider.com/us-copyright-office-new-rules-generative-ai-2023-8" rel="noopener">stuff under copyright</a>, and organizing the datasets for use as free training data for large language models such as <a target="_blank" href="https://www.businessinsider.com/meta-llama-2-ai-model-not-open-source-2023-7" rel="noopener">Meta's Llama</a>. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, according to data from Originality.ai.</p> <p>Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic. Many of those blocking CCBot also block GPTBot. Although it seems ChatGPT's notoriety has caused more companies to block its crawler, despite CCBot likely being active over a longer period of time.</p> <p>While online businesses have been deploying robots.txt to try and stop their data being taken to train AI models, many tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for <a target="_blank" href="https://www.businessinsider.com/tech-updated-terms-to-use-customer-data-to-train-ai-2023-9" rel="noopener">use in AI projects and training</a>.</p> <p>See below for a full list of the biggest websites now blocking GPTBot and CCBot as of Sept. 22:</p> <h2><strong>Blocking GPTBot</strong></h2> <p>amazon.com</p> <p>quora.com</p> <p>nytimes.com</p> <p>theguardian.com</p> <p>shutterstock.com</p> <p>wikihow.com</p> <p>cnn.com</p> <p>sciencedirect.com</p> <p>usatoday.com</p> <p>healthline.com</p> <p>stackexchange.com</p> <p>alamy.com</p> <p>scribd.com</p> <p>webmd.com</p> <p>businessinsider.com</p> <p>dictionary.com</p> <p>reuters.com</p> <p>washingtonpost.com</p> <p>medicalnewstoday.com</p> <p>npr.org</p> <p>cbsnews.com</p> <p>goodhousekeeping.com</p> <p>amazon.co.uk</p> <p>tumblr.com</p> <p>latimes.com</p> <p>insider.com</p> <p>glassdoor.com</p> <p>vocabulary.com</p> <p>investopedia.com</p> <p>slideshare.net</p> <p>amazon.de</p> <p>cosmopolitan.com</p> <p>nbcnews.com</p> <p>indiamart.com</p> <p>stackoverflow.com</p> <p>hindustantimes.com</p> <p>bloomberg.com</p> <p>cnbc.com</p> <p>people.com</p> <p>tvtropes.org</p> <p>amazon.in</p> <p>vimeo.com</p> <p>verywellhealth.com</p> <p>ikea.com</p> <p>espn.com</p> <p>indianexpress.com</p> <p>thesaurus.com</p> <p>pbs.org</p> <p>123rf.com</p> <p>wattpad.com</p> <p>variety.com</p> <p>today.com</p> <p>popsugar.com</p> <p>thespruce.com</p> <p>uol.com.br</p> <p>amazon.fr</p> <p>geeksforgeeks.org</p> <p>elle.com</p> <p>economictimes.com</p> <p>pcmag.com</p> <p>theverge.com</p> <p>allrecipes.com</p> <p>thoughtco.com</p> <p>rollingstone.com</p> <p>wired.com</p> <p>nextdoor.com</p> <p>hollywoodreporter.com</p> <p>abc.net.au</p> <p>ew.com</p> <p>amazon.ca</p> <p>news18.com</p> <p>womenshealthmag.com</p> <p>rateyourmusic.com</p> <p>amazon.co.jp</p> <p>techradar.com</p> <p>airbnb.com</p> <p>ndtv.com</p> <p>lifewire.com</p> <p>tomsguide.com</p> <p>vulture.com</p> <p>everydayhealth.com</p> <p>polygon.com</p> <p>theconversation.com</p> <p>esquire.com</p> <p>prnewswire.com</p> <p>billboard.com</p> <p>menshealth.com</p> <p>metro.co.uk</p> <p>countryliving.com</p> <p>mashable.com</p> <p>gamesradar.com</p> <p>thehindu.com</p> <p>timesofindia.com</p> <p>deadline.com</p> <p>harpersbazaar.com</p> <p>medscape.com</p> <p>nymag.com</p> <p>refinery29.com</p> <p>radiotimes.com</p> <p>cbssports.com</p> <p>tandfonline.com</p> <p>theatlantic.com</p> <p>trulia.com</p> <p>amazon.es</p> <p>pinterest.es</p> <p>nationalgeographic.com</p> <p>bhg.com</p> <p>eater.com</p> <p>southernliving.com</p> <p>healthgrades.com</p> <p>vice.com</p> <p>picclick.com</p> <p>bustle.com</p> <p>newyorker.com</p> <p>eonline.com</p> <p>digitalspy.com</p> <p>opentable.com</p> <p>pinterest.de</p> <p>thepioneerwoman.com</p> <p>caranddriver.com</p> <p>byrdie.com</p> <p>livemint.com</p> <p>medicinenet.com</p> <p>teacherspayteachers.com</p> <p>cookpad.com</p> <p>thespruceeats.com</p> <p>bizjournals.com</p> <p>pagesjaunes.fr</p> <p>liputan6.com</p> <p>delish.com</p> <p>masterclass.com</p> <p>archiveofourown.org</p> <p>vox.com</p> <p>realsimple.com</p> <p>aarp.org</p> <p>francetvinfo.fr</p> <p>pinterest.fr</p> <p>kumparan.com</p> <p>theathletic.com</p> <p>travelandleisure.com</p> <p>vogue.com</p> <p>livescience.com</p> <p>apartments.com</p> <p>marketwatch.com</p> <p>glamour.com</p> <p>amazon.it</p> <p>cinemablend.com</p> <p>thrillist.com</p> <p>amazon.com.br</p> <p>pinterest.co.uk</p> <p>angi.com</p> <p>alamy.es</p> <p>usmagazine.com</p> <p>distractify.com</p> <p>bbcgoodfood.com</p> <p>jagran.com</p> <p>mercadolibre.com.mx</p> <p>androidauthority.com</p> <p>city-data.com</p> <p>foodandwine.com</p> <p>hellomagazine.com</p> <p>amazon.com.au</p> <p>gq.com</p> <p>ingles.com</p> <p>amarujala.com</p> <p>ieee.org</p> <p>prevention.com</p> <p>stern.de</p> <p>kbb.com</p> <p>edmunds.com</p> <p>marthastewart.com</p> <p>pcgamer.com</p> <p>justanswer.com</p> <p>health.com</p> <p>20minutes.fr</p> <p>fortune.com</p> <p>homes.com</p> <p>scientificamerican.com</p> <p>popularmechanics.com</p> <p>verywellfit.com</p> <p>vanityfair.com</p> <p>chicagotribune.com</p> <p>verywellmind.com</p> <p>housebeautiful.com</p> <p>cntraveler.com</p> <p>allure.com</p> <p>spanishdict.com</p> <p>neverbounce.com</p> <p>answers.com</p> <p>moneycontrol.com</p> <p>architecturaldigest.com</p> <p>slate.com</p> <p>lonelyplanet.com</p> <p>inverse.com</p> <p>corriere.it</p> <p>actu.fr</p> <p>self.com</p> <p>tripsavvy.com</p> <p>instyle.com</p> <p>eatingwell.com</p> <p>superuser.com</p> <p>welt.de</p> <p>spiegel.de</p> <p>womansday.com</p> <p>seventeen.com</p> <p>hbr.org</p> <p>oprahdaily.com</p> <p>autotrader.com</p> <p>bonappetit.com</p> <p>sueddeutsche.de</p> <p>seriouseats.com</p> <p>liveabout.com</p> <p>seattletimes.com</p> <p>coursera.org</p> <p>livehindustan.com</p> <p>france24.com</p> <p>townandcountrymag.com</p> <p>dotesports.com</p> <p>worldplaces.me</p> <p>faz.net</p> <p>teenvogue.com</p> <p>motor1.com</p> <p>nj.com</p> <p>glamourmagazine.co.uk</p> <p>okdiario.com</p> <p>brides.com</p> <p>stylecaster.com</p> <p>alamyimages.fr</p> <p>jagranjosh.com</p> <p>theglobeandmail.com</p> <p>axios.com</p> <p>francebleu.fr</p> <p>tabelog.com</p> <p>thebalancemoney.com</p> <p>nydailynews.com</p> <p>sheknows.com</p> <p>naomedical.com</p> <p>verywellfamily.com</p> <h2><strong>Blocking CCBot</strong></h2> <p>nytimes.com</p> <p>shutterstock.com</p> <p>reuters.com</p> <p>goodhousekeeping.com</p> <p>tumblr.com</p> <p>cosmopolitan.com</p> <p>pixabay.com</p> <p>depositphotos.com</p> <p>pbs.org</p> <p>elle.com</p> <p>glosbe.com</p> <p>patch.com</p> <p>wired.com</p> <p>womenshealthmag.com</p> <p>esquire.com</p> <p>indiatoday.in</p> <p>menshealth.com</p> <p>countryliving.com</p> <p>zippia.com</p> <p>chron.com</p> <p>harpersbazaar.com</p> <p>tr-ex.me</p> <p>detik.com</p> <p>theatlantic.com</p> <p>newyorker.com</p> <p>digitalspy.com</p> <p>etymonline.com</p> <p>thepioneerwoman.com</p> <p>caranddriver.com</p> <p>hinative.com</p> <p>teacherspayteachers.com</p> <p>delish.com</p> <p>masterclass.com</p> <p>archiveofourown.org</p> <p>theathletic.com</p> <p>vogue.com</p> <p>glamour.com</p> <p>alltrails.com</p> <p>gq.com</p> <p>ingles.com</p> <p>prevention.com</p> <p>kbb.com</p> <p>popularmechanics.com</p> <p>vanityfair.com</p> <p>housebeautiful.com</p> <p>cntraveler.com</p> <p>allure.com</p> <p>spanishdict.com</p> <p>architecturaldigest.com</p> <p>self.com</p> <p>sfgate.com</p> <p>womansday.com</p> <p>songkick.com</p> <p>seventeen.com</p> <p>oprahdaily.com</p> <p>autotrader.com</p> <p>bonappetit.com</p> <p>aajtak.in</p> <p>coursera.org</p> <p>townandcountrymag.com</p> <p>faz.net</p> <p>teenvogue.com</p> <p>glamourmagazine.co.uk</p> <div class="read-original">Read the original article on <a href="https://www.businessinsider.com/openai-gptbot-ccbot-more-companies-block-ai-web-crawlers-2023-9">Business Insider</a></div><!-- /wp:html -->

Sam Altman, the OpenAI CEO, and an illustration of GPT-4.

Hundreds of major companies and websites are now blocking ChatGPT’s web crawler.Dozens more are also now blocking the crawler of Common Crawl, a major source of AI training data.Unique, high quality data, mainly scraped from the web, is vital to the performance of AI models.

More and more companies are trying to avoid having their data freely scraped and saved by web crawlers working for the benefit of AI models.

Last month, OpenAI last revealed its own crawler, GPTBot, saying it would respect robots.txt, a decades-old method through which a website can tell a web crawler to ignore it. About 70 of the 1,000 most popular sites blocked it, including Amazon and Tumblr.

This week, Insider got new data on this from Originality.ai. It shows that, over the course of about three weeks, the number of top sites blocking GPTbot has jumped to more than 250.

The list of new GPTbot blockers includes Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and what appears to be all titles published by Hearst and those by Conde Nast. Even weather.com is blocking the bot.

Unique and accurate information is vital to the performance of generative AI models like OpenAI’s GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright. A growing awareness of the practice has led to several lawsuits, and new government rules and regulations could be on the way.

Many more companies are now also blocking CCBot, a web crawler used by Common Crawl. Based in Europe, Common Crawl has spent years collecting massive amounts data from the web, including stuff under copyright, and organizing the datasets for use as free training data for large language models such as Meta’s Llama. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, according to data from Originality.ai.

Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic. Many of those blocking CCBot also block GPTBot. Although it seems ChatGPT’s notoriety has caused more companies to block its crawler, despite CCBot likely being active over a longer period of time.

While online businesses have been deploying robots.txt to try and stop their data being taken to train AI models, many tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for use in AI projects and training.

See below for a full list of the biggest websites now blocking GPTBot and CCBot as of Sept. 22:

Blocking GPTBot

amazon.com

quora.com

nytimes.com

theguardian.com

shutterstock.com

wikihow.com

cnn.com

sciencedirect.com

usatoday.com

healthline.com

stackexchange.com

alamy.com

scribd.com

webmd.com

businessinsider.com

dictionary.com

reuters.com

washingtonpost.com

medicalnewstoday.com

npr.org

cbsnews.com

goodhousekeeping.com

amazon.co.uk

tumblr.com

latimes.com

insider.com

glassdoor.com

vocabulary.com

investopedia.com

slideshare.net

amazon.de

cosmopolitan.com

nbcnews.com

indiamart.com

stackoverflow.com

hindustantimes.com

bloomberg.com

cnbc.com

people.com

tvtropes.org

amazon.in

vimeo.com

verywellhealth.com

ikea.com

espn.com

indianexpress.com

thesaurus.com

pbs.org

123rf.com

wattpad.com

variety.com

today.com

popsugar.com

thespruce.com

uol.com.br

amazon.fr

geeksforgeeks.org

elle.com

economictimes.com

pcmag.com

theverge.com

allrecipes.com

thoughtco.com

rollingstone.com

wired.com

nextdoor.com

hollywoodreporter.com

abc.net.au

ew.com

amazon.ca

news18.com

womenshealthmag.com

rateyourmusic.com

amazon.co.jp

techradar.com

airbnb.com

ndtv.com

lifewire.com

tomsguide.com

vulture.com

everydayhealth.com

polygon.com

theconversation.com

esquire.com

prnewswire.com

billboard.com

menshealth.com

metro.co.uk

countryliving.com

mashable.com

gamesradar.com

thehindu.com

timesofindia.com

deadline.com

harpersbazaar.com

medscape.com

nymag.com

refinery29.com

radiotimes.com

cbssports.com

tandfonline.com

theatlantic.com

trulia.com

amazon.es

pinterest.es

nationalgeographic.com

bhg.com

eater.com

southernliving.com

healthgrades.com

vice.com

picclick.com

bustle.com

newyorker.com

eonline.com

digitalspy.com

opentable.com

pinterest.de

thepioneerwoman.com

caranddriver.com

byrdie.com

livemint.com

medicinenet.com

teacherspayteachers.com

cookpad.com

thespruceeats.com

bizjournals.com

pagesjaunes.fr

liputan6.com

delish.com

masterclass.com

archiveofourown.org

vox.com

realsimple.com

aarp.org

francetvinfo.fr

pinterest.fr

kumparan.com

theathletic.com

travelandleisure.com

vogue.com

livescience.com

apartments.com

marketwatch.com

glamour.com

amazon.it

cinemablend.com

thrillist.com

amazon.com.br

pinterest.co.uk

angi.com

alamy.es

usmagazine.com

distractify.com

bbcgoodfood.com

jagran.com

mercadolibre.com.mx

androidauthority.com

city-data.com

foodandwine.com

hellomagazine.com

amazon.com.au

gq.com

ingles.com

amarujala.com

ieee.org

prevention.com

stern.de

kbb.com

edmunds.com

marthastewart.com

pcgamer.com

justanswer.com

health.com

20minutes.fr

fortune.com

homes.com

scientificamerican.com

popularmechanics.com

verywellfit.com

vanityfair.com

chicagotribune.com

verywellmind.com

housebeautiful.com

cntraveler.com

allure.com

spanishdict.com

neverbounce.com

answers.com

moneycontrol.com

architecturaldigest.com

slate.com

lonelyplanet.com

inverse.com

corriere.it

actu.fr

self.com

tripsavvy.com

instyle.com

eatingwell.com

superuser.com

welt.de

spiegel.de

womansday.com

seventeen.com

hbr.org

oprahdaily.com

autotrader.com

bonappetit.com

sueddeutsche.de

seriouseats.com

liveabout.com

seattletimes.com

coursera.org

livehindustan.com

france24.com

townandcountrymag.com

dotesports.com

worldplaces.me

faz.net

teenvogue.com

motor1.com

nj.com

glamourmagazine.co.uk

okdiario.com

brides.com

stylecaster.com

alamyimages.fr

jagranjosh.com

theglobeandmail.com

axios.com

francebleu.fr

tabelog.com

thebalancemoney.com

nydailynews.com

sheknows.com

naomedical.com

verywellfamily.com

Blocking CCBot

nytimes.com

shutterstock.com

reuters.com

goodhousekeeping.com

tumblr.com

cosmopolitan.com

pixabay.com

depositphotos.com

pbs.org

elle.com

glosbe.com

patch.com

wired.com

womenshealthmag.com

esquire.com

indiatoday.in

menshealth.com

countryliving.com

zippia.com

chron.com

harpersbazaar.com

tr-ex.me

detik.com

theatlantic.com

newyorker.com

digitalspy.com

etymonline.com

thepioneerwoman.com

caranddriver.com

hinative.com

teacherspayteachers.com

delish.com

masterclass.com

archiveofourown.org

theathletic.com

vogue.com

glamour.com

alltrails.com

gq.com

ingles.com

prevention.com

kbb.com

popularmechanics.com

vanityfair.com

housebeautiful.com

cntraveler.com

allure.com

spanishdict.com

architecturaldigest.com

self.com

sfgate.com

womansday.com

songkick.com

seventeen.com

oprahdaily.com

autotrader.com

bonappetit.com

aajtak.in

coursera.org

townandcountrymag.com

faz.net

teenvogue.com

glamourmagazine.co.uk

Read the original article on Business Insider

By