Your Sitecore Site is Being Farmed to Train AI

> Do you have a plan?
Cover Image for Your Sitecore Site is Being Farmed to Train AI

Overview

Sitecore is used by many large and reputable organizations: banks, insurance companies, government agencies, not for profits, law firms, etc. These are some of the most reputable and trusted sites on the internet, and this makes them a prime target for web scraping and data mining for training AIs. They are large, complex, and have a lot of data. They are also often not as well protected as they should be. Many Sitecore sites have effectively gifted all of their data to AI companies.

And now:

Chamber of Progress, a tech-industry lobbying group whose members include Apple, Meta and Amazon, has launched a campaign to defend the practice of using copyrighted works to train AI.

This emerging paradigm calls many existing tools, processes, jobs, and companies into question, and renders many obsolete. This also creates endless new opportunities for those who can adapt.

In this post, I will discuss some of the implications of this new paradigm for implementation partners and their clients.

The Discovery

This hit close to home when I was looking through some IIS logs and found Anthropic's "ClaudeBot" making over 325,000 requests to a Sitecore site in a single day. They weren't even subtle about it; they were crawling the site at around 10 requests per second for more than 8 hours straight, and the requests seemed to be focused on content that was uniquely dense and valuable. This was not a simple "crawl the site and move on" operation; this was a targeted data mining operation.

I looked further back in time and saw that the requests started in November of 2023 and have been happening consistently and more comprehensively ever since.

Anthropic isn't the only one. OpenAI's crawler has been crawling the site as well, and started doing so around the same time as Anthropic.

I started thinking about the implications of this, specifically with regards to all of my Sitecore clients. As Sitecore implementation partners, we are in a unique position in understanding how our clients' business models might be in need of serious re-evaluation. We have a duty to take the lead in guiding our clients through this new reality. Their survival might depend on it.

A Thought Experiment

Imagine that you are trying to build artificial general intelligence (AGI). You are very well funded. You have a team of engineers and data scientists who are working around the clock. You are willing to do whatever it takes to build the best training dataset in the world.

You believe that achieving AGI is the most important thing that you can do for humanity. You are therefore willing to break some rules and conventions to make it happen, because the reward far outweighs the risks.

What would you do? Perhaps you would:

  • Mine data from sites as much as possible and as quickly as possible before they implement controls to stop you or slow you down
  • Create AI agents that browse the web like a human would, passing through account registrations, login screens, forms, captchas, and other barriers which block your access to more data
  • Use AI to make your crawler super efficient and advanced and able to adjust crawling strategy at any time
  • Make your crawler appear as human-like as possible to avoid detection
  • Ignore robots.txt and other rules that sites have set up to prevent you from scraping their data
  • Use AI to discover security bypasses and vulnerabilities in websites to get more data
  • Buy data from data aggregators
  • Buy data from the dark web / hackers
  • Hack into servers to get data

Now imagine you are a black hat hacker with the goal of using AI to make ill-gotten gains. You are willing to break any law or rule, and AI empowers you more than ever.

Average web etiquette ignorer

On the flip side, imagine you are the CEO of a reputable organization with valuable data on your site. What would it mean for your organization if an AGI was trained on all of your data, and that the AGI could recite and reason about all of your data faster, more accurately, and less expensively than any human could, all via a simple and cheap API call? For many organizations, this requires a rethink of their business model, and fast.

Notable Content Subjects

Certain subject areas get special treatment in the world of search and AI. These are areas where good information is particularly valuable because the stakes are high, and where errors / omissions can have serious consequences. At a high level they are:

  • Health and medicine
  • Personal finance
  • Legal
  • Safety and security
  • Education
  • News and current events
  • Employment and careers
  • Housing and real estate
  • Travel

If your organization falls into one of these categories, understand that your data is particularly valuable, and that your future business model should leverage your data.

This article put it well:

[OpenAI's chief architect] urged companies to differentiate by using language AI APIs and creating unique user experiences, data approaches and model customizations.

... the key differentiator for businesses building language model-powered services is leveraging your own proprietary data.

“The user experience you create, the data you bring to the model and how you customize it and the like service that you expose to the model, that is actually where you folks are going to differentiate and build something like genuinely unique,” Jarvis said. “If you just build a wrapper around one of these very useful models, then you're no different than your competitors.”

The Everything API

Chat-based AI systems are effectively API endpoints which perform arbitrary tasks using plain language on a pay-as-you-go basis. One of the most fascinating realizations for me was that chat based AI systems turn any arbitrary content (such as the entire web itself) into a fully customizable API. In this new world, EVERYTHING is an API; websites, PDFs, images, videos, etc.

Further, the outputs can be formatted in any desired data structure such as JSON, XML, etc, which makes it easy to integrate with any system. All of this can be done very quickly with minimal technical knowledge.

This means that your Sitecore site is now also an API.

The Changing Role of DXP Implementation Partners

There has been much talk of implementation partners migrating their clients to the cloud, going headless / composable, and so on; however, much of this is just a rehash of the same old stuff:

  • Distributed / composable vs monolithic
  • Reliability
  • Scalability
  • Performance
  • Cost reduction
  • Revenue generation
  • Risk mitigation
  • Security

All of the above are important, but they are not new. The new stuff is how to operate and market in a world where AI is the new normal.

A significant source of work and revenue for implementation partners is going to be:

  • Making client data discoverable and queryable by AI systems
  • Training proprietary AI on proprietary data
  • Data protection and security
  • Discovery and implementation of AI tools for both internal and external purposes such as translation, content generation, search, etc.
  • General consultation for how to navigate the new AI-driven world
  • Integration between systems
  • Cost reduction using AI
  • Revenue growth using AI

All of which is to say that implementation partners, just like their clients, will need to adapt to the new AI-driven world.

Takeaways

  • Assume that web crawlers are AI-driven. They are not just simple bots that crawl the web and index pages. They are AI systems that are trained to understand and reason about the data that they are collecting. They are able to make inferences and draw conclusions from the data that they collect. They are able to learn from the data that and improve their performance over time.
  • Assume that web crawlers may not respect the rules that you have set up for them.
  • Ask your web team if they are monitoring and managing these crawlers.
  • Implement WAF, bot detection, rate limiting, and monitoring.
  • If your WAF is showing periods of increased traffic, investigate it, even if it is not causing any issues.
  • Monitor logs for unusual activity, using AI where appropriate.
  • Monitor your bandwidth usage.
  • Consider developing a technical and legal framework around the use of your data.
  • Think about how your organization will need to evolve in a world where AI is the new normal.
  • Think about training your own AI on your data.
  • Think hard about what aspects of your business cannot be made obsolete by AI, or in fact would be improved by AI augmentation. Lean into those.

Stay.... Intelligent.

-MG


More Stories

Cover Image for NextJS: Short URL for Viewing Layout Service Response

NextJS: Short URL for Viewing Layout Service Response

> Because the default URL is 2long4me

Cover Image for NextJS: Access has been blocked by CORS policy

NextJS: Access has been blocked by CORS policy

> CORS is almost as much of a nuisance as GDPR popups

Cover Image for JSS + TypeScript Sitecore Project Tips

JSS + TypeScript Sitecore Project Tips

> New tech, new challenges

Cover Image for On Sitecore Development

On Sitecore Development

> Broadly speaking

Cover Image for Sitecore Symposium 2022

Sitecore Symposium 2022

> What I'm Watching 👀

Cover Image for Azure PaaS Cache Optimization

Azure PaaS Cache Optimization

> App Services benefit greatly from proper configuration

Cover Image for Tips for New Sitecore Developers

Tips for New Sitecore Developers

> If I had more time, I would have written a shorter letter

Cover Image for SPE Script Performance & Troubleshooting

SPE Script Performance & Troubleshooting

> Script never ends or runs too slow? Get in here.

Cover Image for Add TypeScript Type Checks to RouteData fields

Add TypeScript Type Checks to RouteData fields

> Inspired by error: Conversion of type may be a mistake because neither type sufficiently overlaps with the other.

Cover Image for Tips for Forms Implementations

Tips for Forms Implementations

> And other pro tips

Cover Image for Ideas For Docker up.ps1 Scripts

Ideas For Docker up.ps1 Scripts

> Because Docker can be brittle

Cover Image for Don't Ignore the HttpRequestValidationException

Don't Ignore the HttpRequestValidationException

> Doing so could be... potentially dangerous

Cover Image for NextJS/JSS Edit Frames Before JSS v21.1.0

NextJS/JSS Edit Frames Before JSS v21.1.0

> It is possible. We have the technology.

Cover Image for Troubleshooting 502 Responses in Azure App Services

Troubleshooting 502 Responses in Azure App Services

> App Services don't support all libraries

Cover Image for Script: Boost SIF Certificate Expiry Days

Script: Boost SIF Certificate Expiry Days

> One simple script that definitely won't delete your system32 folder

Cover Image for Content Editor Search Bar Not Working

Content Editor Search Bar Not Working

> Sometimes it works, sometimes not

Cover Image for Early Returns in React Components

Early Returns in React Components

> When and how should you return early in a React component?

Cover Image for Hello World

Hello World

> Welcome to the show

Cover Image for Tips for Applying Cumulative Sitecore XM/XP Patches and Hotfixes

Tips for Applying Cumulative Sitecore XM/XP Patches and Hotfixes

> It's probably time to overhaul your processes

Cover Image for On Sitecore Stack Exchange (SSE)

On Sitecore Stack Exchange (SSE)

> What I've learned, what I see, what I want to see

Cover Image for NextJS: Unable to Verify the First Certificate

NextJS: Unable to Verify the First Certificate

> UNABLE_TO_VERIFY_LEAF_SIGNATURE

Cover Image for JSS: Reducing Bloat in Multilist Field Serialization

JSS: Reducing Bloat in Multilist Field Serialization

> Because: performance, security, and error-avoidance

Cover Image for Critical Security Bulletin SC2024-001-619349 Announced

Critical Security Bulletin SC2024-001-619349 Announced

> And other scintillating commentary

Cover Image for Symposium 2022 Reflections

Symposium 2022 Reflections

> Sitecore is making big changes

Cover Image for On Mentorship and Community Contributions

On Mentorship and Community Contributions

> Reflections and what I learned as an MVP mentor

Cover Image for How to Run Old Versions of Solr in a Docker Container

How to Run Old Versions of Solr in a Docker Container

> Please don't make me install another version of Solr on my local...

Cover Image for Year in Review: 2022

Year in Review: 2022

> Full steam ahead

Cover Image for Super Fast Project Builds with Visual Studio Publish

Super Fast Project Builds with Visual Studio Publish

> For when solution builds take too long

Cover Image for Security Series: App Service IP Restrictions

Security Series: App Service IP Restrictions

> How to manage IP rules "at scale" using the Azure CLI