What is Duplicate Content?
Simply put, duplicate content is a block of text on a webpage that also appears on other webpages on the internet. The block of text that’s duplicate can be as small as a sentence or two, or as large as the entire page, perhaps thousands of words of recycled content.
From the perspective of a digital marketer or website owner, there are two main types of duplicated content.
The first is content that’s uniquely duplicated within a single website, what many would refer to as “internal” duplicate content. Although the duplicate content isn’t found on any other website and is unique to that one specific domain, it’s published to multiple pages on that domain.
The second type of duplicate content is content that is duplicated across multiple domains on the internet, what many would refer to as “external” duplicate content.
Although some web pages (and domains!) are entirely “scraped”, or intentionally populated with recycled content, most instances of duplicate content are a little more benign. This could include common forms of duplicate content such as manufacturer provided product descriptions, boilerplate usage of shipping information, or return policies. It could also include highly “templated” content styles where much of the content remains the same, except for specific SEO-related variables (think landing pages where only the location name or service type differs).
How Does Google Identify Duplicate Content?
Google can identify duplicated content through a variety of ways that all tie back to how Google fundamentally crawls, renders and caches URLs over the web. One of the ways is something called “block-level analysis”, a process where Google breaks down the actual HTML source code used to create a web page and classifies pages into discreet “blocks” of code. Google can “recognize” exactly on how many pages of how many websites it has come across any given block and it knows where on the page that block of code appears when it’s fully rendered.
Based on that information, Google can then classify and weight the impact of these blocks by their overall relationship to one another. So, we understand that the header on a webpage has more impact or weight than a footer of that same page and we understand that the body content between the two carries even more weight. In the same manner of weighting, Google places far more emphasis and “value” on truly unique blocks of content. It places significantly less weight and value on blocks of content that appear elsewhere, whether on other pages within a single website or across multiple websites.
The more pervasive the duplication, the less and less weight will be given to that block of content on any given page. In this sense, the very first instance of a block that Google encounters will be given a certain level of value, but for every additional instance of that block, the less value and weight it will be given by Google.
How Much Content is Duplicated?
Much more than you’d think! As early as 2013-2014, Google engineer Matt Cutts indicated that as much as 25% to 30% of ALL content that’s on the web is duplicated somewhere else at least once.
A few years later, a Raven Tools study that leveraged their Site Auditor results, indicated very similar results – just under 30% of pages feature duplicated content.
Is Duplicate Content Bad?
Here is where many of the misunderstandings begin to arise. Because, as frustrating as this sounds, it depends. Context matters, and this is a nuanced topic. Ready to dig into a little history? Here goes!
Prior to 2011, Google had a massive search results problem. Besides highly ranked but low-quality content farms, like eHow, there were also pervasive article spinning and website scraping abuses that were rife throughout the search results. To combat this, Google released the Panda Update in early 2011, which turned the industry on its head. Sites that had unfairly gamed the search algorithms with massive swaths of thin or duplicated pages saw their visibility disappear nearly overnight.
Today, Panda is more generally appreciated as an overall “content quality” update, but initially, duplicate content farms were perceived as a prime target for Panda’s ranking adjustments. Now, you will rarely (thankfully!) see a website that ranks well with pre-Panda levels of duplicate content. But that doesn’t mean all duplicate content is gone. Remember, as much of 30% of pages will have at least some form of repurposed content block.
So, what is the status of these pages? It depends on how much unique content exists alongside the duplicated content. The more unique content that’s relevant and valuable, to balance out the of duplicated content, the greater the chances will be for that page to rank, all other things being equal.
This means pages that include the occasional boilerplate warranty information or manufacturer provided product specs may still have enough unique content on the page to be deemed relevant and valuable to search queries. But pages that primarily contain duplicated and have little to no unique material are perceived by Google as provding little to no “value-add” for their searchers, and are not likely to be ranked well.
Is there a Duplicate Content Penalty?
This might be the biggest myth that’s still got legs. No, there is not a duplicate content “penalty”. There never was.
So where does this myth come from? Generally speaking, it goes back to Panda. Because the Panda algorithm updated ranking factors to reward more unique and relevant content that hadn’t previously ranked on Page 1 of Google, the lower quality pages that had been previously ranking were leapfrogged.
When sites that were previously less visible saw ranking improvements, it came at the expense of other sites. The websites who lost rankings were not explicitly penalized, per se. But they failed to rank as well for as many queries as they had previously ranked for, because they lacked unique, valuable content for those terms. It was these sites that saw massive ranking drops that they then perceived (and communicated) as “penalties”.
While perception often becomes reality for organizations, it’s important to recognize the difference between changes in content valuations and actual “penalties”. Truly penalized sites were, and are, essentially deindexed entirely from Google. They just disappear. But websites impacted by Panda didn’t entirely disappear. They just lost a great deal of prior visibility. Some sites hit hard by Panda made the good faith efforts to invest in and change their overall content uniqueness and quality. Those sites recovered rankings soon enough. But most sites that were hit by Panda didn’t make such changes and were relegated to the dustbin of search engine history. Although their fate is often described as falling to a penalty, that’s not an apt description of what happened to them.
How much is too much duplicate content?
Ok, so if there’s no duplicate content penalty, but duplicate content can still be “bad” in large enough quantities, how much is too much? Again, that depends on context. One largely overlooked factor is the industry/sector benchmarks Google has at its disposal. To the degree that many other websites and competitors can create 100% unique content for their pages, the more that expectation will be baked in to overall visibility requirements for a website. Where an industry or sector finds itself needing to use templated or boilerplate content frequently, the content quality benchmarks might become a little more forgiving. As a general rule, however, there should be more unique content on a webpage than there is duplicate if that page is going to have consistent visibility in search engines. Within any given website, page types or sections of a website that feature more unique content are more likely to rank well than pages that feature less unique content, all things being equal.
How Can We Handle Duplicate Content?
Digital marketers and webmasters that come across duplicate content on their websites, thankfully, have a range of options at their disposal. They generally sort themselves into one of three solutions: markup, redirects or content editing.
Many pages that technically include duplicate content also have parameter appended URL strings, such as https://www.example.com/page?sort=hi-lo or https://www.example.com/page?utm_source=facebook. These pages actually “should” have duplicate content, because they are the same page, although not the same URL. Since it’s only the parameters that make these pages “different” the parameters need to be handled, not the duplicate content.
Canonical markup can also be used to resolve duplicate pages that don’t involve parameters. Sometimes features like faceted navigation can create different URL paths that lead to the same page of content:
Or a canonical may be used when websites publish content, which already exists, on another domain, which would be an acceptable use of cross-domain canonicalization:
Finally, in terms of markup solutions, there is always the index attribute. Remember, not every page on a website needs to be crawled and indexed by Google. For example if multiple versions of a landing page exist to better track and attribute conversions and visitors by traffic source, any that don’t pertain directly to organic search can be noindexed.
Suffice to say, if a page has little unique content for search engine users and content editing is deemed within an organization as prohibitively expensive or resource-intensive, then the page probably shouldn’t be indexed by Google at all. Utilize source code markup like meta=noindex OR leverage the site’s robots.txt file to identify page(s) that should not be crawled and indexed by Google.
If parameter handling or canonical markup aren’t ideal options, 301 redirects are also an SEO-friendly alternative. Take the duplicated page and modify server side elements like an .htaccess file to redirect the duplicate page to the authoritative version.
If neither markup nor server side redirects are an option, then it may be time for website stakeholders to actively edit and rewrite duplicate content to make their pages more unique. The content cycles for such an initiative can be prioritized on estimated visibility improvements or ROI and laid out like any other editorial calendar.
By now, you should have a much better understanding of just what duplicate content is, why it can be, but isn’t necessarily, problematic for a website’s organic search visibility and how to address instances of it on your own websites. By honestly assessing your own duplicate content issues, and the context in which they exist, you will sleep much more soundly at night. Especially now that you’re not kept awake by nightmares of duplicate content “penalties”.