Understanding the Threat of Scraping
Before you can build a defence, you need to understand the attack. 'Image scraping' is the automated process where bots, often operated by AI companies, crawl the internet and download massive quantities of images. These images—from your personal blog,
art portfolio, or social media—become the raw material for training generative AI models like DALL-E, Midjourney, and Stable Diffusion. Without your consent, your creative work or personal photos could be used to teach an algorithm how to replicate styles, generate new images, or even create deepfakes. This not only infringes on potential copyright but also represents a massive privacy violation. Protecting your work is no longer just about preventing theft; it's about controlling your digital identity and creative output in the age of AI.
The First Line: Website-Level Controls
If you host your own website, portfolio, or blog, your first line of defence lies in communicating your wishes directly to these bots. You can do this in two primary ways. First, update your `robots.txt` file. This is a simple text file in your site's root directory that gives instructions to web crawlers. By adding directives like `User-agent: CCBot` and `Disallow: /`, you can block specific known AI scrapers. Second, use HTML meta tags. A new standard, `noimageai`, can be placed in the `` section of your web pages. This tag signals to compliant AI crawlers that the images on the page should not be used for training purposes. While not all bots will respect these signals, it’s a crucial first step that establishes your intent and blocks ethical crawlers.
Active Disruption: Tools like Glaze and Nightshade
For artists and creators, passive requests may not be enough. This is where active disruption tools come in. Developed by researchers at the University of Chicago, tools like Glaze and Nightshade offer a more aggressive way to protect your art. Glaze works by adding a very subtle layer of 'noise' or 'style cloak' to your images before you upload them. To the human eye, the image looks unchanged. But to an AI model, the cloak confuses it about the artistic style, making it difficult for the model to learn from and replicate your work. Nightshade is even more potent; it acts as a 'data poison' tool. It manipulates pixels in your art in a way that causes the AI model training on it to malfunction, corrupting its ability to understand concepts. For example, a 'poisoned' image of a dog might teach the model that dogs look like cats, subtly sabotaging the dataset.
Rethinking Watermarks and Metadata
Traditional watermarks have often been seen as easy to remove, but their role is evolving. Instead of a simple logo in the corner, consider more integrated, less obtrusive watermarks that are harder for automated tools to crop out. More importantly, embedding copyright information directly into the image's metadata (the EXIF data) is a vital practice. This includes your name, contact information, and a clear copyright statement. While scrapers often strip this data, having it embedded provides a crucial piece of evidence if you ever need to prove ownership. It’s a layer of deterrence and documentation that, when combined with other methods, strengthens your overall security posture.
Reviewing Your Platform's Terms of Service
Most people don't host their own websites; they use social media and portfolio platforms like Instagram, Behance, or Flickr. It is absolutely critical to read the Terms of Service for any platform where you upload your images. Many of these services have clauses that grant them a broad license to use, modify, and even sublicense your content. Furthermore, their privacy settings may determine whether your content is exposed to third-party crawlers. Take the time to navigate the privacy and content settings on each platform. Opt for the most restrictive settings available, such as making your profile private or limiting access to specific individuals. While this may reduce public visibility, it significantly enhances your control over how your images are used.
















