Duplicate content is a very prevalent and serious issue for many websites.
It can be very damaging to a site’s organic health and even prevent web pages from ranking or being indexed in the search results, whether you’re a small blog or a large ecommerce site.
Read our latest guide to learn more about duplicate content, why it’s harmful for SEO, and how to find / resolve duplicate content issues that may be plaguing your website.
Duplicate content is information that exists in many places on the Internet. The "one spot" is identified as a location with a single website address (URL); thus, duplicate content occurs when the same content appears at multiple web addresses.
Duplicate content, while not legally a punishment, may have an effect on search engine rankings. It may be challenging for search engines to determine which iteration of content is more important to a specific search question because there are several pieces of "appreciably identical" content in multiple locations on the Internet, as Google describes it.
The process of eliminating duplicate content isn’t complicated but it requires commitment. It takes a concerted effort to get your content to stand out from the pack. If you get caught up in some of the usual tactics associated with improving your website’s visibility, you’ll end up with an outdated website that’s full of duplicate content.
Google filters for similar material on your website - if it finds duplicate content, your web pages' rankings will decline.
Google’s search crawlers can get confused when trying to determine which duplicate content or web pages should rank for a given query. This can cause what is known as keyword cannibalization, resulting in decreased rankings for that given query (or in some cases Google may elect to rank none of your duplicate content).
If there are multiple versions of the same page, this can divide link equity (both from backlinks and internal linking), causing your web pages to rank less effectively compared to if that equity was consolidated into a single version of the page.
For duplicate content from external websites, Google will most likely not index your web pages because that information already exists on the web, which means you won’t have the ability to rank or drive organic traffic to the site for those duplicated pages.
Simply put, Google doesn’t issue manual penalties for duplicate content. However, it may choose to not index your pages that contain a large portion of duplicate content, which is similar to penalization in its algorithm.
Here is Google’s take on duplicate content:
“Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don’t follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.”
Read our latest guide to learn more about search engine optimization basics that you should know about for your website.
Below we’ll talk through the most common issues that create duplicate content on a website:
This is an obvious one, but copying content from external sources on the web. Google is able to easily find which websites you stole content from and determine if your web page should be indexed or not in its search results.
Note: copying content, or quoting sources, isn’t a bad thing. It’s when more than 50% of your web page is copied content that creates issues for your ability to rank for your target phrases in the search results.
If you have multiple versions of a single web page, and those web pages don’t have canonical tags, this can cause mixed signals to Google when determining which version of that web page should rank.
In a nutshell, canonical tags tell Google which version of the a page it should pay attention to, while ignoring the other variants. So if you have versions A, B, C, & D, you may want to place a self-referencing canonical tag on version A, and point canonical tags from versions B, C, & D to version A.
This tells Google “hey, pay attention to version A and pass all the link equity to this page, and please ignore versions B, C, & D.”
If canonical tags aren’t in place, Google won’t know which of these versions it should be indexing and it may choose to ignore your desired version of the web page.
URL variants are another cause for duplicate content issues. These are formed when you have multiple versions of a web page (such as when you perform A/B testing); when UTM parameters are appended to your URLs; and when you have multiple variants of the URL itself.
For the last item, an example would be:
Google considered both of these versions to be unique URLs, rather than the same, and will get confused as to which variant it should be crawling and indexing. A solution to URL variants is to ensure that 301 redirects are in place for all versions of your URLs that point to your preferred URL path.
Read our latest guide for best practices when creating SEO friendly URLs for your website.
When dealing with localized versions of content on a site that are translated into different languages, Google can get confused and consider these pages to be duplicative.
The preferred solution to show the relationship of these pages is to use what are known as hreflang tags.
If hreflang tags aren’t present, Google may flag your content as duplicate or get confused as to which one it should be ranking in the search results.
Syndicated content can play a big issue for websites that send out a lot of press releases or articles that get picked up by other publications in a syndication cycle.
Because the content is 100% duplicated, and there may be several publications that republish your article, Google may not know that your website was the original publisher, causing the possibility that one of the republishers will rank for that content instead of your own website.
For syndicated content, canonical tags need to be in place to show Google that your content is the “master copy” and to ignore all of the other syndicated content that’s live on other publisher websites.
Lastly, duplicate content is prevalent on e-commerce websites that have thousands of product pages.
While products may have slight variations, if you use the same product descriptions, this will be flagged as duplicate content. Google may deem these product pages to not be valuable for users in the search results and choose to not index or rank any of your product pages because they are all using the same product descriptions. It can also impact your website’s crawl budget and prevent new pages from being discovered.
To fix this, it’s best practice to include unique or dynamically generated content for all of your product pages to allow them the chance to rank effectively for their target keywords. If the descriptions aren’t unique from one another, then they won’t rank or be indexed.
Now that we’ve walked through the most common causes of duplicate content on a website, let’s discuss how to find duplicate content.
One of the best ways to find duplicate content is to conduct a Screaming Frog crawl.
You also have the ability to run a crawl for near-duplicate content, using a slide bar of 1-100% to find near matches between your site pages.
Google Search Console houses an index coverage report that will display the following duplicate content issues on a website:
If you’re looking for duplicate content from external websites, there are several tools out there when checking for plagiarism. One of the best ones is CopyScape, which is free. It can also be used to check for plagiarism for up to 1000 words for content that hasn’t been published yet on your website.
Below we’ll walk through how to fix internal duplicate content issues on your website.
First, you can simply implement 301 redirects for any URL variants or versions of your website that you don’t want Google crawling and indexing.
When several sites with good ranking potential are merged into a single post, they not only avoid battling with each other, but they also build a greater relevancy and visibility signal overall. This would improve the potential of the "right" page to rank high.
Like I mentioned previously, canonical tags is another option for handling duplicate content issues when you have multiple page variants at play, but you don’t want to implement something like a 301 redirect. Again, a canonical tag will tell Google to treat a variant of web page as a “copy” and to pass that ranking authority to your preferred version of the web page, while still allowing both variants to be accessible to users and search engines.
Noindex tags will signal to Google’s search crawlers that a page should be ignored / not indexed in its search engine results.
This is valuable for pages that may contain thin content that isn’t valuable to search engines, but you still want users to access. This can also help in instances such as pagination of articles or blog posts.
You can set up specific directives for Google’s crawl bot via Search Console on how to handle URL parameters. This approach is especially helpful for websites that use a lot of UTM parameters, or parameterized URLs to filter results (like on e-commerce sites).
The biggest disadvantage of utilizing parameter settings in Search Console is that the directives you set are only effective for Google. The guidelines you set up in Google Search Console would have little impact on how Bing or any other search engine's crawlers view the site; you'll need to use other webmaster software, like Bing Webmaster Tools or Yandex, for certain search engines in addition to the Search Console settings.