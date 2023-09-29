Sam Altman, the OpenAI CEO, and an illustration of GPT-4.
JASON REDMOND/AFP via Getty Images; Jaap Arriens/NurPhoto via Getty Images
Hundreds of major companies and websites are now blocking ChatGPT’s web crawler.Dozens more are also now blocking the crawler of Common Crawl, a major source of AI training data.Unique, high quality data, mainly scraped from the web, is vital to the performance of AI models.
More and more companies are trying to avoid having their data freely scraped and saved by web crawlers working for the benefit of AI models.
Last month, OpenAI last revealed its own crawler, GPTBot, saying it would respect robots.txt, a decades-old method through which a website can tell a web crawler to ignore it. About 70 of the 1,000 most popular sites blocked it, including Amazon and Tumblr.
This week, Insider got new data on this from Originality.ai. It shows that, over the course of about three weeks, the number of top sites blocking GPTbot has jumped to more than 250.
The list of new GPTbot blockers includes Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and what appears to be all titles published by Hearst and those by Conde Nast. Even weather.com is blocking the bot.
Unique and accurate information is vital to the performance of generative AI models like OpenAI’s GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright. A growing awareness of the practice has led to several lawsuits, and new government rules and regulations could be on the way.
Many more companies are now also blocking CCBot, a web crawler used by Common Crawl. Based in Europe, Common Crawl has spent years collecting massive amounts data from the web, including stuff under copyright, and organizing the datasets for use as free training data for large language models such as Meta’s Llama. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, according to data from Originality.ai.
Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic. Many of those blocking CCBot also block GPTBot. Although it seems ChatGPT’s notoriety has caused more companies to block its crawler, despite CCBot likely being active over a longer period of time.
While online businesses have been deploying robots.txt to try and stop their data being taken to train AI models, many tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for use in AI projects and training.
See below for a full list of the biggest websites now blocking GPTBot and CCBot as of Sept. 22:
Blocking GPTBot
amazon.com
quora.com
nytimes.com
theguardian.com
shutterstock.com
wikihow.com
cnn.com
sciencedirect.com
usatoday.com
healthline.com
stackexchange.com
alamy.com
scribd.com
webmd.com
businessinsider.com
dictionary.com
reuters.com
washingtonpost.com
medicalnewstoday.com
npr.org
cbsnews.com
goodhousekeeping.com
amazon.co.uk
tumblr.com
latimes.com
insider.com
glassdoor.com
vocabulary.com
investopedia.com
slideshare.net
amazon.de
cosmopolitan.com
nbcnews.com
indiamart.com
stackoverflow.com
hindustantimes.com
bloomberg.com
cnbc.com
people.com
tvtropes.org
amazon.in
vimeo.com
verywellhealth.com
ikea.com
espn.com
indianexpress.com
thesaurus.com
pbs.org
123rf.com
wattpad.com
variety.com
today.com
popsugar.com
thespruce.com
uol.com.br
amazon.fr
geeksforgeeks.org
elle.com
economictimes.com
pcmag.com
theverge.com
allrecipes.com
thoughtco.com
rollingstone.com
wired.com
nextdoor.com
hollywoodreporter.com
abc.net.au
ew.com
amazon.ca
news18.com
womenshealthmag.com
rateyourmusic.com
amazon.co.jp
techradar.com
airbnb.com
ndtv.com
lifewire.com
tomsguide.com
vulture.com
everydayhealth.com
polygon.com
theconversation.com
esquire.com
prnewswire.com
billboard.com
menshealth.com
metro.co.uk
countryliving.com
mashable.com
gamesradar.com
thehindu.com
timesofindia.com
deadline.com
harpersbazaar.com
medscape.com
nymag.com
refinery29.com
radiotimes.com
cbssports.com
tandfonline.com
theatlantic.com
trulia.com
amazon.es
pinterest.es
nationalgeographic.com
bhg.com
eater.com
southernliving.com
healthgrades.com
vice.com
picclick.com
bustle.com
newyorker.com
eonline.com
digitalspy.com
opentable.com
pinterest.de
thepioneerwoman.com
caranddriver.com
byrdie.com
livemint.com
medicinenet.com
teacherspayteachers.com
cookpad.com
thespruceeats.com
bizjournals.com
pagesjaunes.fr
liputan6.com
delish.com
masterclass.com
archiveofourown.org
vox.com
realsimple.com
aarp.org
francetvinfo.fr
pinterest.fr
kumparan.com
theathletic.com
travelandleisure.com
vogue.com
livescience.com
apartments.com
marketwatch.com
glamour.com
amazon.it
cinemablend.com
thrillist.com
amazon.com.br
pinterest.co.uk
angi.com
alamy.es
usmagazine.com
distractify.com
bbcgoodfood.com
jagran.com
mercadolibre.com.mx
androidauthority.com
city-data.com
foodandwine.com
hellomagazine.com
amazon.com.au
gq.com
ingles.com
amarujala.com
ieee.org
prevention.com
stern.de
kbb.com
edmunds.com
marthastewart.com
pcgamer.com
justanswer.com
health.com
20minutes.fr
fortune.com
homes.com
scientificamerican.com
popularmechanics.com
verywellfit.com
vanityfair.com
chicagotribune.com
verywellmind.com
housebeautiful.com
cntraveler.com
allure.com
spanishdict.com
neverbounce.com
answers.com
moneycontrol.com
architecturaldigest.com
slate.com
lonelyplanet.com
inverse.com
corriere.it
actu.fr
self.com
tripsavvy.com
instyle.com
eatingwell.com
superuser.com
welt.de
spiegel.de
womansday.com
seventeen.com
hbr.org
oprahdaily.com
autotrader.com
bonappetit.com
sueddeutsche.de
seriouseats.com
liveabout.com
seattletimes.com
coursera.org
livehindustan.com
france24.com
townandcountrymag.com
dotesports.com
worldplaces.me
faz.net
teenvogue.com
motor1.com
nj.com
glamourmagazine.co.uk
okdiario.com
brides.com
stylecaster.com
alamyimages.fr
jagranjosh.com
theglobeandmail.com
axios.com
francebleu.fr
tabelog.com
thebalancemoney.com
nydailynews.com
sheknows.com
naomedical.com
verywellfamily.com
Blocking CCBot
nytimes.com
shutterstock.com
reuters.com
goodhousekeeping.com
tumblr.com
cosmopolitan.com
pixabay.com
depositphotos.com
pbs.org
elle.com
glosbe.com
patch.com
wired.com
womenshealthmag.com
esquire.com
indiatoday.in
menshealth.com
countryliving.com
zippia.com
chron.com
harpersbazaar.com
tr-ex.me
detik.com
theatlantic.com
newyorker.com
digitalspy.com
etymonline.com
thepioneerwoman.com
caranddriver.com
hinative.com
teacherspayteachers.com
delish.com
masterclass.com
archiveofourown.org
theathletic.com
vogue.com
glamour.com
alltrails.com
gq.com
ingles.com
prevention.com
kbb.com
popularmechanics.com
vanityfair.com
housebeautiful.com
cntraveler.com
allure.com
spanishdict.com
architecturaldigest.com
self.com
sfgate.com
womansday.com
songkick.com
seventeen.com
oprahdaily.com
autotrader.com
bonappetit.com
aajtak.in
coursera.org
townandcountrymag.com
faz.net
teenvogue.com
glamourmagazine.co.uk