Omaha Media Group

Protect Your Ranking: Clean Your Site’s Cruft

We all have it. The cruft. The low-quality, or even duplicate-content pages on our sites that we just haven't had time to find and clean up. It may seem harmless, but that cruft might just be harming your entire site's ranking potential. By cruft what I mean is low quality, thin quality, duplicate content types of pages that can cause issues even if they don't seem to be causing a problem today.


What is Cruft?

If you were to, for example, launch a large number of low quality pages, pages that Google thought were of poor quality, that users didn't interact with, you could find yourself in a seriously bad situation, and that's for a number of reasons. So Google, yes, certainly they're going to look at content on a page by page basis, but they're also considering things domain wide.

So they might look at a domain and see lots of these high quality, high performing pages with unique content, exactly what you want. But then they're going to see thin content pages with low engagement metrics that don't seem to perform well and duplicate content pages that don't have proper canonicalization on them yet. This is really what I'm calling cruft, two things, and many variations of them can fit inside those.

But one issue with cruft for sure it can cause Panda issues. So Google's Panda algorithm is designed to look at a site and say, “You know what? You're tipping over the balance of what a high quality site looks like to us. We see too many low quality pages on the site, and therefore we're not just going to hurt the ranking ability of the low quality pages, we're going to hurt the whole site.” Very problematic, really, really challenging and many folks who've encountered Panda issues over time have seen this.

There are also other probably non-directly Panda kinds of related things, like site-wide analysis of things like algorithmic looks at engagement and quality. So, for example, there was a recent analysis of the Phantom II update that Google did, which hasn't really been formalized very much and Google hasn't said anything about it. But one of the things that they looked at in that Phantom update was the engagement of pages on the sites that got hurt versus the engagement of pages on the sites that benefited, and you saw a clear pattern. Engagement on sites that benefited tended to be higher. On those that were hurt, tended to be lower. So again, it could be not just Panda but other things that will hurt you here.

It can waste crawl bandwidth, which sucks. Especially if you have a large site or complex site, if the engine has to go crawl a bunch of pages that are cruft, that is potentially less crawl bandwidth and less frequent updates for crawling to your good pages. It can also hurt from a user perspective. User happiness may be lowered, and that could mean a hit to your brand perception. It could also drive down better converting pages. It's not always the case that Google is perfect about this. They could see some of these duplicate content, some of these thin content pages, poorly performing pages and still rank them ahead of the page you wish ranked there, the high quality one that has good conversion, good engagement, and that sucks just for your conversion funnel. So all sorts of problems here, which is why we want to try and proactively clean out the cruft. This is part of the SEO auditing process.

Filter Your Cruft!

One of those ways for sure that a lot of folks use is Google Analytics or Omniture or Webtrends, whatever your analytics system is. What you're trying to design there is a cruft filter. So get your handy dandy filter, keep all the good pages inside, and filter out the low quality ones.

You can use one of two things. First, a threshold for bounce or bounce rate or time on site, or pages per visit, any kind of engagement metric that can be used as a potential filter. You could also do some sort of a percentage, meaning in scenario one you basically say, “Hey the threshold is anything with a bounce rate higher than 90%, I want my cruft filter to show me what's going on there.” Create that filter inside Google Analytics or inside Omniture. Then look at all the pages that match that criteria, then try and see what was wrong with each and fix accordingly.

The second one is basically saying, “Hey, here's the average time on site, here's the median time on site, here's the average bounce rate, median bounce rate, average pages per visit, median, great. Now take me 50% below that or one standard deviation below that. Now show me all that stuff, filter that out.”

This process is going to capture thin and low quality pages.  Duplicate content pages are likely to perform very similarly to the thing that they are a duplicate of. So this process is helpful for one of those, not so helpful for other ones.

Sort Your Cruft!

Basically, in this case, you've got a cruft sorter that is essentially looking at filtration, items that you can identify in things like the URL string or in title elements that match or content that matches, those kinds of things, and so you might use a duplicate content filter. Most of these pieces of software already have a default setting. In some of them you can change that. Google Webmaster Tools, now Search Console, allows you to change the duplicate content filter.

You may say, “Hey, identify anything that's more than 80% duplicate content.” Or if I know that I have a site with a lot of pages that have only a few images and a little bit of text, but a lot of navigation and HTML on them, well, maybe I'd turn that up to 90% or even 95% depending.

You can also use some rules to identify known duplicate content violators. So for example, if I've identified that everything that has a question mark refer equals bounce or something or partner. Well, okay, now I just need to filter for that particular URL string, or I could look for titles. So if I know that, for example, one of my pages has been heavily duplicated throughout the site or a certain type, I can look for all the titles containing those and then filter out the dupes.

I can also do this for content length. Many folks will look at content length and say, “Hey, if there's a page with fewer than 50 unique words on it in my blog, show that to me. I want to figure out why that is, and then I might want to do some work on those pages.”

Ask SERP Providers (Cautiously)

Then the last one that we can do for this identification process is Google and Bing Webmaster Tools/Search Console. They have existing filters and features that aren't very malleable. We can't do a whole lot with them, but they will show you potential site crawl issues, broken pages, sometimes dupe content. They're not going to catch everything though. Part of this process is to proactively find things before Google finds them and Bing finds them and start considering them a problem on our site. So we may want to do some of this work before we go, “Oh, let's just shove an XML sitemap to Google and let them crawl everything, and then they'll tell us what's broken.” A little risky.

Additional Tips, Tricks & Robots

A couple additional tips, analytics stats, like the ones from Google Analytics or Webtrends, they can totally mislead you, especially for pages with very few visits, where you just don't have enough of a sample set to know how they're performing or ones that the engines haven't indexed yet. So if something hasn't been indexed or it just isn't getting search traffic, it might show you misleading metrics about how users are engaging with it that could bias you in ways that you don't want to be biased. So be aware of that. You can control for it generally by looking at other stats or by using these other methods.

When you're doing this, the first thing you should do is any time you identify cruft, remove it from your XML sitemaps. That's just good hygiene, good practice. Oftentimes it is enough to at least have some of the preventative measures from getting hurt here. However, there's no one size fits all methodology after the don't include it in your XML sitemap. If it's a duplicate, you want to canonicalize it. I don't want to delete all these pages maybe. Maybe I want to delete some of them, but I need to be considered about that. Maybe they're printer friendly pages. Maybe they're pages that have a specific format. It's a PDF version instead of an HTML version. Whatever it is, you want to identify those and probably canonicalize.

Is it useful to no one? Like literally, absolutely no one. You don't want engines visiting. You don't want people visiting it. There's no channel that you care about that page getting traffic to. Is it useful to some visitors, but not search engines? Like you don't want searchers to find it in the engines, but if somebody goes and is paging through a bunch of pages and that kind of thing, okay, great, I can use no index, follow for that in the meta robots tag of a page.

If there's no reason bots should access it at all, like you don't care about them following the links on it, this is a very rare use case, but there can be certain types of internal content that maybe you don't want bots even trying to access, like a huge internal file system that particular kinds of your visitors might want to get access to but nobody else, you can use the robots.txt file to block crawlers from visiting it. Just be aware it can still get into the engines if it's blocked in robots.txt. It just won't show any description. They'll say, “We are not showing a site description for this page because it's blocked by robots.”

With this process, hopefully you can prevent yourself from getting hit by the potential penalties, or being algorithmically filtered, or just being identified as not that great a website. You want Google to consider your site as high quality as they possibly can. You want the same for your visitors, and this process can really help you do that.

Contact Us

We want to hear from you, so what are you waiting for? Contact us today.

Get In Touch
Share this post