How Google's search dominance threatens publishers in the AI age

The practical artificial The intelligent answers that Google now places at the top of its search results (in some markets) come at the expense of the websites that users would otherwise visit. But many website owners say they can’t afford to stop Google’s AI from summarizing their content.

That’s because the Google tool that crawls web content to find its AI answers is the same one that tracks web pages for search results, the publishers say. Blocking Google in the same way that websites have blocked some of their AI competitors would also hurt a site’s ability to be found online.

Google’s dominance in search – which a US court last week ruled was an illegal monopoly – gives the company a decisive advantage in the burgeoning artificial intelligence war, which search engine startups and publishers say is unfair given the growing shape of the industry. The dilemma is particularly acute for publishers, who face the choice of either submitting their content to AI models that could render their websites obsolete or disappearing from Google search, one of its most important sources of traffic.

“This is becoming an existential crisis for these companies,” said Joe Ragazzo, editor of the news site Talking Points Memo. “Those are two bad options. You either get out and die instantly, or you partner with them and probably die slowly because they don’t need you anymore.”

Google says AI Overviews – the summaries that appear at the top of Google Search – are part of its long-standing commitment to providing higher quality information and improving opportunities for publishers and other businesses. “Every day, Google sends billions of clicks to sites across the web, and we intend to continue this long-established value exchange with sites,” a Google spokesperson said in a statement. “With AI Overviews, people find search more useful and they return to search more often, creating new opportunities for content discovery.”

Googlebot

Since its inception, Google has used software called Googlebot to visit, or “crawl,” millions of websites, creating a detailed index of the global Internet. Over the years, this index has presented a formidable barrier to entry for companies trying to build competing search engines – even those with deep pockets like Microsoft.

The rise of generative AI has sparked a new wave of startups seeking to offer search products that use AI models to provide succinct answers to users’ questions. The popularity of chatbots has caused panic at Google about the future of its search engine, which for a long time seemed invincible. But before these startups can really put a dent in the search giant, they need to scour the web. And that’s no easy task.

Crawling costs website owners money, processing power and storage space, so many publishers include a file that sets rules for bots visiting their sites. The companies that have the most leeway are usually Google and Microsoft’s Bing, which can drive traffic to websites through their search engines.

Read: US government considers splitting up Google

But search engine startups can’t promise such traffic before they even gain a foothold. That’s one reason the young companies have started making deals where they pay publishers to license content, says Alex Rosenberg, CEO of AI startup Tako.

“Today, there are a number of technology companies that pay for content, they pay for access to it because they need it to be able to seriously compete,” Rosenberg said. “Google, on the other hand, doesn’t really have to do that.”

In a wave of deals between media companies and AI startups, Google is a conspicuous outsider. With the exception of a reported $60 million deal with Reddit, Google has signaled to publishers behind closed doors that it is not interested in negotiations, say two people familiar with the matter who asked not to be identified because the information is confidential.

Media companies have little say in these conversations. Earlier this year, Google introduced AI Overviews, in which the company uses AI to provide succinct answers to some of users’ questions at the top of the search page. Publishers were immediately concerned about the impact the answers could have on their traffic, but had no clear way to allay those fears.

Google uses a separate crawler for some AI products, such as the Gemini chatbot. But its main crawler, Googlebot, serves both AI Overviews and Google Search. A company spokesperson said Googlebot controls AI Overviews because AI and the company’s search engine are closely intertwined. The spokesperson added that the search results page displays information in a variety of formats, including images and graphics. Google also said publishers can block certain pages or parts of pages from appearing in AI Overviews in search results — but that would likely also prevent those snippets from appearing in all of Google’s other search features, including web link lists.

Many publishers, who often rely on search engines for at least half of their traffic, are unwilling to take the risk of reducing their reach.

Google’s position “underestimates the significant risk this poses to content creators, particularly those who rely on visibility in search results for their livelihood,” said Marc McCollum, head of innovation at Raptive, which represents publishers and influencers. “By opting out, creators may inadvertently reduce their overall search presence, which could impact their ability to reach their audience and generate revenue.”

Kyle Wiens, CEO of iFixit, a website that publishes free online repair guides for consumer electronics, said the site’s relationship with Google is “much more fragile” than with other AI companies. “I can block ClaudeBot from indexing us without hurting our business,” Wiens wrote in an email, referring to the bot from generative AI startup Anthropic. “But if I block Googlebot, we lose traffic and customers.”

Out of reach

Google’s deal with Reddit, where millions of users engage in heated debates about niche topics, provides the company with a treasure trove of information for AI models. The deal coincided with changes Google made to increase the presence of results from forums like Reddit in search results, leading to a huge increase in traffic to the social media site. A Reddit spokesperson said improvements in product quality and speed also contributed to the increase in traffic.

Search engine startup Perplexity is currently negotiating with Reddit to license content, but the Google deal has set a price that is difficult for a startup to match, according to a person familiar with the matter. Google said the deal with Reddit is a wide-ranging partnership that includes more than just training data. Reddit’s spokesperson declined to comment on the business talks. Perplexity declined to comment.

Other search engine start-ups concluded that the data was simply inaccessible.

Read: Google has an illegal search monopoly

“It would take us 20 years of our current revenue just to pay for Reddit,” says Vladimir Prelovac, founder of search engine startup Kagi. “It’s not even a possibility I’m considering.”

Small startups are not alone in their problems. OpenAI recently launched SearchGPT, a test version of its hugely popular chatbot tailored for search. But popular websites like Amazon, Goodreads and Uniqlo have blocked the GPT crawler from their sites, according to public documentation, potentially spelling trouble for OpenAI’s search ambitions. OpenAI has stated that websites can appear in its search results even if they choose to exclude their content from AI training.

Prelovac said at least half of Kagi’s costs are spent on crawling and other search data sources. A detailed index of the web is a must for a search engine to provide users with a detailed look at what’s on the internet. But for companies that want to answer users’ questions directly using AI, a model popularized by ChatGPT, the data is even more important, Prelovac said.

“Generative AI models alone are not very intelligent,” said Prelovac. “To get high-quality generative AI results, you need to have access to the same search index.”

The ubiquity of robots.txt files, which set guidelines for crawling, forces startups to make complex decisions, says Richard Socher, founder of search engine startup You.com. The files are not legally binding, so companies are allowed to crawl public data as long as no login or subscriber information is required, says Socher.

“When we crawl, we try not to place undue stress on any one site,” he said. “Any site that has a robots.txt file that only Google is allowed to crawl and no one else is essentially supporting a Google search monopoly.”

Neeva, a search startup founded by former Google employees that was acquired by Snowflake last year, argued for “crawl neutrality” to make it easier for startups to build their search indexes. Following a landmark ruling by a US court finding that Google monopolizes the online search market, the US Department of Justice is considering taking remedial measures, including forcing the search giant to share more data with rivals, or even breaking up the company. One proposal that has attracted considerable attention is to require Google to share the data it collects through Googlebot, or open up its famous search index to its competitors. The EU’s Digital Markets Act already requires Google to share some search query data.

For Wiens, CEO of iFixit, the advantage Google has over other AI companies because of its search empire is at the heart of the company’s antitrust problems. “Separating Google Search from its AI work,” he said, “would defuse the conflicts.”

Problematic

Search engine DuckDuckGo said recent technological changes in search “make Google’s index even more problematic in relation to antitrust concerns.”

“Search indexes are extremely important in the age of generative AI,” said Kamyl Bazbaz, senior vice president of public affairs at DuckDuckGo.

Regardless of the outcome of the antitrust case, the ongoing changes in the search landscape underscore how important it is for publishers to control their own destinies and not rely too heavily on any single technology platform – including Google, says TPM’s Ragazzo.

“We believe that you have to build a real relationship with your readers,” says Ragazzo. “This is how you create a publication that can transcend different eras.” — Julia Love and Davey Alba, with Leah Nylen and Shirin Ghaffary, (c) 2024 Bloomberg LP