How to Block ChatGPT From Stealing Content From Your Site

Don't let your site become "training data"

Mar 06, 2023

Large Language Models like ChatGPT source their training data from a variety of sources including books, emails, and Wikipedia.

But the biggest set of training data (and most relevant to site owners) comes from a dataset generated by a bot that scrapes every single site on the Internet.

You have to block the bot if you want to prevent ChatGPT from stealing content from your site. Here’s how.

This is a free article. If you’re not already a paid subscriber, upgrade to get access to full-length SEO guides, Q&A sessions, and premium business content not found anywhere else.

Upgrade to paid

Blocking the Common Crawl Bot

According to a research paper, GPT 3 and GPT 3.5 used five different datasets for their training data.

The biggest source (weighted at 60%) is a dataset called “Common Crawl.”

According to their homepage, Common Crawl “builds and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.”

Common Crawl uses a bot - called CCBot - that functions similarly to the Googlebot that you’re already familiar with. It follows links around the internet, hoovering up any new pages that it comes across into its extensive database.

It then makes this database available to “internet researchers, companies and individuals at no cost for the purpose of research and analysis”, including OpenAI.

So what do you do if you want to prevent Common Crawl from making a copy of your website and letting it be used as training data for ChatGPT?

Fortunately there’s an easy way to block the Common Crawl Bot by adding a few lines to your robots.txt file:

  User-agent: CCBot
  Disallow: /

That’s it.

The next time the Common Crawl Bot comes to your website and checks your robots.txt file, it will stop crawling your content.

Common Crawl even provides instructions on their website:

If you’re new and don’t know what a robots.txt file is, the easiest way to create and edit one is using the Yoast SEO WordPress plugin.

Once you install the plugin:

Click on “Yoast SEO” in the menu
Click on “Tools”
Click on “File Editor”
Click on the “Create robots.txt file” button
Copy/paste the above lines into the file
Click on “Save changes to robots.txt”

Unfortunately there’s no way to opt your previously-crawled content out of OpenAI’s database. You should assume that ChatGPT has access to everything you published before blocking CCBot.

But this will prevent your website from being included in future datasets and protect any new content that you publish.

OpenWebText2

There’s one more way that ChatGPT scrapes content from sites to use for its training data: the OpenWebText2 dataset (weighted at 22% in the training mix).

The OpenWebText2 dataset consists of URLs that have been linked to on Reddit with at least 3 upvotes. The idea here is that, if a page has at least 3 upvotes, it’s most likely high quality enough to be used as training data.

Unfortunately I couldn’t find a way to block OpenWebText2 (created by OpenAI) from using your website’s content.

Detect AI-generated content using Originality.AI, the most accurate AI detection tool available today.

Originality.AI can ID content created by all of the leading generative AI platforms including ChatGPT and Jasper. This tool is a must-have if you have freelance writers or guest posters publishing content on your site.

Start detecting AI content

The good news is that you most likely have very few URLs on your site that are linked to on Reddit with at least 3 upvotes. Even for huge publishers, this is likely only going to cover a tiny percentage of pages on their site.

You can download the public OpenWebText2 dataset with 17,103,059 documents in the “Plug and Play” version and 69,547,149 documents in the “Raw Scrapes” version here if you want to examine the data for yourself.

Conclusion

ChatGPT uses scraped content from websites as a MAJOR source of training data.

The majority of OpenAI’s scraped website content comes from the Common Crawl dataset, which you can easily “opt out” of by adding a few lines to your robots.txt file on your website.

The process takes less than 2 minutes and prevents Common Crawl from making a copy of your website that can be used in future AI training datasets.

If you don’t mind the fact that ChatGPT is scraping your site, then you don’t have to do anything.

But if it bothers you that they are building their business off of content that you put time and money into without compensating you or asking for your permission, then blocking the Common Crawl Bot is a no-brainer.

It should be up to site owners to decide for themselves how their content is used.