What Is Photo Scraping?
Photo scraping is the automated process of downloading large numbers of images from the internet. Bots, or 'scrapers,' crawl websites, social media platforms, and public forums to collect visual data. Historically, this was done for things like search
engine indexing. However, the game has changed with the rise of generative artificial intelligence (AI). Now, massive datasets of images are essential for 'training' AI models like DALL-E, Midjourney, and Meta’s Llama. These models learn patterns, styles, and objects from your photos, enabling them to generate entirely new images. The core issue for many users is that their personal photos—of their faces, families, and artwork—are being used to build commercial products without their knowledge or permission.
Why You Should Care Now
The rapid advancement of AI has created an insatiable appetite for data. Tech companies argue that using publicly available information is fair game. However, privacy advocates and creators raise valid concerns. Your likeness could be used to create deepfakes, your artistic style could be mimicked by an AI, or your personal moments could become part of a dataset sold to third parties. While it’s nearly impossible to completely stop all forms of scraping, major platforms are beginning to introduce tools that give users a semblance of control. Understanding and using these settings is your first line of defence in this new digital landscape.
Update Your Settings on Meta (Facebook & Instagram)
Meta has been one of the most prominent companies using public user data to train its generative AI models. Fortunately, they provide an opt-out mechanism, though it requires you to be proactive. To object to your data being used, you need to fill out a specific form titled 'Generative AI Data Subject Rights'. 1. **Find the Form:** Search for “Meta Privacy Centre” and navigate to the section on generative AI. You should find a link to the data rights form. It's often buried in the privacy policy details. 2. **State Your Case:** The form will ask for your country of residence, email address, and a field to explain how this processing affects you. You can state your objection clearly and simply. For example: “I do not consent to my personal information, including my photos and likeness, being used for training generative AI models due to privacy concerns.” 3. **Submit and Wait:** After submitting, you should receive a confirmation email. Meta says it will honour these requests going forward. It's important to note this may not apply to data already scraped.
Managing Your Content on Google
Google’s approach is slightly different. The company states it does not use content from Google Photos, Drive, or Gmail to train its generative AI models like Gemini. The primary concern is with content you make public elsewhere, such as on public blogs or websites indexed by Google Search. Google introduced a setting that allows website owners to opt out of having their content used for training. For website owners, you can add a simple instruction to your site’s `robots.txt` file. This file tells web crawlers what they can and cannot access. To prevent Google’s main training bot (Google-Extended) from scraping your site, you add the following lines to your `robots.txt`: `User-agent: Google-Extended` `Disallow: /` This is a technical step, but it’s the most direct way to signal your preference to Google. For regular users, the best defence is limiting what you post on public websites.
General Best Practices for All Platforms
Beyond platform-specific settings, you can adopt several habits to better protect your images online: * **Review Your Audience Settings:** On Instagram, Facebook, and X (formerly Twitter), regularly check who can see your posts. Setting your profile to 'Private' is the single most effective way to prevent public scraping. * **Be Mindful of Public Forums:** Images posted on sites like Reddit or public forums are easily scraped. Think twice before sharing personal photos in these spaces. * **Consider Watermarking:** If you are a creator or artist, applying a visible watermark to your images can deter casual scraping and makes it harder for AI to learn your specific style without attribution. * **Use 'noimageindex' Tags:** If you run your own website, you can tell Google not to index the images on a specific page by adding a ` ` tag to the page’s HTML. This prevents them from appearing in Google Image Search, reducing their visibility to scrapers.
















