Unmanaged duplicate content (aka duplicate content) is, in my opinion, one of the most detrimental search engine optimization issues for a website, with the potential to significantly impact your rankings and results. organic performance.
If you've been involved in digital marketing for a while, you've most likely heard of "duplicate content," perhaps from your company's SEO teams, content marketers, or partner SEO agencies . You may also have listened to an explanation and feel like you have a basic understanding of what the duplicate content entails.
Over the past few years, I've read, watched, and heard a plethora of different explanations of duplicate content; from SEO forums to social media posts, and even professional agency blog posts. There are many cases – especially since 2013 – where sites were launched with issues that were never identified, and therefore never reached their potential. As a result, I can't help but think that many people (including professional SEOs) don't fully understand what duplicate content is and how it can impact your online presence.
Given the potential impact, it's surprising that there's so much misinformation about what duplicate content is and how to fix it. In this article, I will explain to you:
- What is duplicate content?
- How does duplicate content happen?
- How to manage duplicate content?
Related: SEO migration from A to Z
What is duplicate content?
“Your website appears to contain large amounts of duplicate content”.
“But we wrote all the content ourselves! ?”
The first obstacle to overcome is the language; more often than not, people associate duplicate content with plagiarism. This is not the case.
There are two categories of duplicate content:
- internal duplicate content (on-site)
- external duplicate content (off-site)
Parallels can be drawn between off-site duplicate content issues and plagiarism, although this is not a technical problem that you can control.
The causes, impacts, and solutions associated with each type are entirely different, and trust me, internal duplicate content is the worst case scenario! It is this category that I will address in this guide.
By my definition (which I may have read it somewhere or made it up), “internal duplicate content” is a technical SEO issue, caused by the way a website is designed. It occurs when a specific webpage is displayed on several different URLs. This is not content that has been stolen, reused, or taken from other places on the web or from your website.
So you know that almost every website driven by a content management system (CMS) produces duplicate content – the question is whether it is managed properly or not.
The simplest example is your home page. A home page may appear when you type example.com or www.example.com. In this case, the same content is rendered on two different URLs, which means that one of them is a duplicate of the other.
This is only a problem if search engines are able to crawl duplicates. That said, never underestimate a Googlebot's ability to find things. It usually has a nudge, like an XML (or HTML) sitemap or a misconfigured CMS link. When Google sends you over 50% of your customers online, it's worth taking precautions.
So why worry about internal duplicate content?
Don't worry, but be aware of it. Google's index is entirely URL-based. When the same page is rendered by two different URLs, there is no clear indication as to which page is correct. As a result, neither page ranks as well as it should in the SERPs.
Also, in May 2012, among a series of updates, Google included tougher penalties for duplicate content as part of its Panda 3.4 update. I was lucky enough to work on a site at the time that was heavily penalized following the update, and I quickly learned how to deal with duplicate content penalties.
It's worth mentioning at this point that unlike Penguin's link penalties, duplicate content penalties can be removed very quickly by taking the right steps. In my experience, you don't need to wait for an update from Panda.
Duplicate Content Signs
Duplicate content can appear in a number of cases, but it most often occurs at the time of a Panda update, after the launch of a new website or during changes made to a site where the management of the duplicate content was implemented incorrectly (or not at all). Rankings and traffic start to drop, but the impact depends on the severity of the problem.
If you're familiar with duplicate content, you'll be able to find it by performing manual checks on a site, but for a quick check you can do a site search in Google (site:yourdomain.com). If you see the following message on the last page of search results, the content may be duplicated. You'll have to do more research to be sure.
How does duplicate content occur?
Home page duplicate
As I mentioned at the beginning, one of the most common cases of duplicate content on every website is duplication between the www subdomain and the non-www root domain.
For example :
- www.example.com
- exemple.com
Depending on your server, you will find that the home page may also be displayed at:
- example.com/index.php (linux servers)
- www.example.com/index.php (linux servers)
- example.com/home.aspx (windows servers)
- www.example.com/home.aspx (Windows servers)
This is the simplest and most visible case of duplicate content, and most people are aware of it.
This kind of duplication usually happens throughout a website, so if your site is rendering to www.example.com and example.com, it's probably also rendering to www.example.com/category and example.com/ category. This means that duplicates are present throughout the site and have a significant impact on organic performance.
Solutions
- Redirection 301 (permanente)
- Canonical link element
Subfolders, subcategories, and child pages
Most websites use some form of categories and subcategories to help users find information. Categories are often the most important areas of an e-commerce site, as they intuitively target specific, refined search terms. For example, if I'm selling widgets on Widgets.com, and a potential customer wants to buy “blue widgets”, most often a category page for “blue widgets” will be returned as a result. The same goes for any site that categorizes content into subfolders and child pages.
Let's say my category structure is as follows:
exemple.com/category/sub-category
Here, the user probably navigated to the first category and then to one of its subcategories. Many systems allow this subcategory to be rendered at example.com/sub-category without including the parent category in the URL. This subcategory now renders the same content across multiple URLs; one that includes the parent category, and another that does not.
The same goes for child pages that can be rendered at example.com/category/product and example.com/product. This can happen on a non-commercial site like example.com/services/service-name and example.com/service-name.
Solution
- Redirection 301 (permanente)
- Canonical link element
Pagination
In some cases, the content of a category page may be split into multiple pages; 1, 2 and 3, for example. We call this a “paged series”.
Continuing the previous example, this is what page 1 would normally look like:
exemple.com/categorie
Page 2 will then be accessible at the following address: example.com/category/?p=2
How pagination is reflected in the URL depends on the site configuration. In this case, we are still in the same category, but on the second page. Search engines may well interpret the following pages as duplicates of page 1.
Solution :
- rel=”next” and rel=”previous” link elements.
The settings
Most websites add a parameter to a URL based on certain conditions, such as using a filter, a “sort by” feature, or a variety of other purposes. A common cause is the use of “breadcrumbs” that help users navigate a site. Breadcrumbs represent the path the user took to reach a specific page, and are usually clickable for navigation purposes.
Breadcrumbs are user specific and are driven by session parameters which are sometimes visible in the page URL.
For example
exemple.com/category/sub-category/product/?Path=312&214
Here, “Path” refers to the path taken by the user, and the numbers represent specific categories. In this example, the user navigated to category 312 and then to category 214. This can generate breadcrumbs that look like this:
home -> category -> subcategory -> product.
We are still on the same product page identified in the URL, but with URL parameters that create the breadcrumbs.
The same content is displayed on this page, but can be accessed using different URLs. This problem is exacerbated by the number of different routes a user can take, which greatly increases the number of duplicates.
Solution
- Canonical link element
End-of-line capitals and slashes
Some platforms tend to ignore capitalization in URLs, allowing a page to be displayed regardless of capitalization. If the page is accessed from URLs that contain uppercase and URLs that only use lowercase, you're likely going to run into issues. For example
- exemple.com/categorie
- exemple.com/categorie
The same goes for trailing slashes (/) in URLs:
- exemple.com/category
- exemple.com/category/
Solution
- Redirection 301 (permanente)
- Canonical link element
Random CMS Junk
This is obviously not a technical term. Not all websites run on the latest and most up-to-date CMS platform. Many of them are outdated, bespoke, and frankly, not in good shape for SEO.
The quality of a bespoke CMS, for example, is directly linked to the knowledge and abilities of the development team that built it. A slight lack of technical SEO knowledge can result in a site that produces a large amount of duplicate dynamic content.
Finding this type of content is quite simple: do a site search in Google using “site:example.com”. Look for indexed URLs containing “?”, path parameters, “index.php/?”. Assuming your URLs are SEO friendly, they are most likely unmanaged duplicates of canonical pages.
Solution
- Canonical link element
Localization and translation
There are two ways to tailor content to an audience. Localization involves providing the content in the same language, but the information is tailored to each audience to account for language differences. These variants can exist on a subdomain (us.example.com) or a subfolder (example.com/us).
Where equivalent pages exist for another locality (such as uk.example.com or example.com/uk), the content should be localized for two reasons
- ensure the right content is ranked for the right audience
- to ensure that similar content is not considered a duplicate.
The same goes for translation, except that the difference is in the language. For example, fr.example.com or example.com/fr.
The important thing is that search engines don't perceive these pages as unmanaged duplicates, or as different pages; it's the same page, adapted for a different audience.
Solution
- I will address this point in a future article.
Other cases of duplicate content
Duplicate content can come in other forms. Once you understand what it is, you can identify and fix duplication issues. Remember that “duplicate content occurs when the same page is rendered at multiple URLs”.
How to manage duplicate content?
First of all, duplicate content is not a bad thing – almost all websites produce duplicate content. The problem is that this content is not managed using 301 redirects, robot directives, canonical link elements, or alternate link elements.
301 redirects (permanent)
Until the introduction of the canonical link element, 301 redirects were the best way to deal with duplicate content. However, redirect and link elements work differently.
Once a 301 redirect is applied to duplicate content, the user will no longer be able to access it and will be redirected to (everything is fine) the canonical (correct) version. The problem is that often the duplicates exist precisely for the users. Continuing with the example of path parameters, breadcrumbs are very easy to use for visitors. If URLs including path parameters are redirected, breadcrumbs will no longer work properly, which will affect site navigation.
A 301 should only be applied to pages that offer no added value to the user, such as the root domain and subdomain (www.example.com and example.com). By doing so, around 90% of the authority of the donor page to the target page of the redirect is maintained, which consolidates your link capital.
Canonical link elements
The canonical link element treats duplicate content the same as a 301 redirect, with one exception: users can still access the page. This is therefore the most efficient way to manage duplicates without risking harming the user experience.
A canonical link element looks like this:
<link rel=”canonical” href=”http://example.com”>
It points to the canonical (correct) version of the web page it is on. The beauty of the canonical link element is that it can be applied site-wide, providing protection against duplicate content issues whether there is a problem or not.
The canonical version of the page must have a self-referring canonical link element, that is, a link that points to itself. Therefore, all duplicates of this page will have a canonical link element pointing to the canonical version.
Like a 301 redirect, the canonical link element passes around 90-95% of the link value to the target page. Canonical link elements also work across domains. So if for some reason your site is rendered on a second domain, the canonical link elements will always link back to the original, avoiding duplication issues.
One last tip
There are a few nuances to getting the most out of a canonical link element and choosing the canonical version. The version defined as canonical will be ranked in search engines. It is therefore necessary to use the one which is most likely to be well classified.
For example, I may have a product page that renders at example.com/mens-shoes/black-shoes and also at example.com/black-shoes. If someone searches for “black shoes for men”, which of these sites is most likely to rank? When the category or subcategory contains important search terms, it may be worth defining the canonical version as the one that includes them in the URL.
You may have noticed the appearance of “structured breadcrumbs” at some point in 2013, or maybe not. Traditionally, when a web page appears in the SERPs, the page URL appears below the page title.
With the right code in place, it is now possible to display the actual architecture of the site, based on breadcrumbs.
Referring to my previous example of categories, subcategories and child pages, for these beautifully structured elements to show up, the canonical versions of the subcategories MUST include the parent categories in the URL so that the canonical version includes the correct breadcrumb trail.
Robots.txt
Neither duplicate content nor indexing should be handled using the robots.txt file. A disallow entry in Robots.txt provides meta directives at the root domain level and it is very common for disallowed pages in Robots.txt to continue to be indexed when accessed directly by Googlebot or another crawler. Once a banned page is indexed, it remains in the index regardless of the contents of your robots.txt file and also prevents crawlers from detecting canonical link elements on the pages in question. Take a look below:
If you insist on trying to manage duplicate content by controlling indexing, it's best to use the “noindex” meta directive at the page level – a much more reliable solution. However, this will not pass link authority to canonical pages like a canonical link element or 301 redirect would.
Well… any questions?
At 2400 words, there's still a lot I'd like to write on the subject, and maybe I will. If after reading this you still don't know what duplicate content is, feel free to ask for help in the comments below.
Comments
Post a Comment