Contact Us
  • The Definitive Guide to Blog Content Scraping & How to Stop It!

    Anyone who blogs great content, or even not-so-great content, knows that their posts are going to end up on numerous other blogs around the Internet within days, sometimes minutes, of when they’re published.

    Anyone can easily install WordPress, skin it with a custom theme, and grab a couple plugins that will go out and “scrape” content from other blogs to publish on their own blog.

    You CAN greatly reduce the content scraping of your blog!

    There are a number of ways to approach content scraping of your blog posts, from doing nothing to blocking scrapers IP by IP via HTACCESS (not recommended).

    Doing nothing about content scrapers — the easiest approach!

    Yes, some, like Chris Coyer of CSS-Tricks, recommend doing nothing.

    Chris says that instead of going to war with the scrapers, “you could spend that time doing something enjoyable, productive, and ultimately more valuable for the long-term success of your site.”

    He goes on to list the reasons your content will always do better in searches that the hijacked content, i.e., because your blog:

    • is on a domain with more trust;
    • published that article first;
    • is coded better for SEO than theirs;
    • is better designed than theirs;
    • isn’t at risk for serious penalization from search engines.

    All of the above may be true, but I found that in some instances sites that had copied my articles were ranking higher than my original content! Although I think this may have been a temporary glitch due to Google’s March 2011 “Panda” update (a primary goal of which was to improve detection of scraper sites, but hit a lot of quality sites!), it still motivated me to do whatever I could to eliminate the copying and duplicate publishing of my blog posts.

    Check out Jeff Starr’s rebuttal of the do-nothing approach.

    Proactive Approaches to Dealing with Blog Content Scraping

    If, like me, you want to try to greatly reduce the incidents of your posts being copied wholesale on other blogs, there are ways that will definitely help you get this done.

    1. Ping Google & Other Search Engines and RSS Feed Sites when you Publish
      This notifies the search engines and RSS site-updating services of your post, ensuring your content gets indexed first.

      WordPress has a built-in way to list the sites you want to ping when you publish your post. In your WordPress admin, navigate to:

      Settings > Writing > Update Services (near the bottom)

      Just enter the URLs for the services to ping.

      The WordPress Codex offers a list of site-update services to ping. This site has a much more comprehensive list (If you use this list, be sure to remove the spaces before the URLs and have just single-line breaks between each URL.)

    2. Try to contact the offending blog owner
      Sometimes the scraping is done by a human who honestly doesn’t think there’s anything wrong with copying your post. I’ve had two situations where I’ve contacted the owner and asked, civilly but forcefully, to remove my copied content, and they removed it!

      Unfortunately, most of the content scraping is done by bots and there’s no one actually “manning” the blog. So this approach is not really the best, unless there’s a human repeat offender, and sometimes they’ll just ignore you.

    3. Include links to other posts on your blog
      In a YouTube video, Matt Cutts of Google’s Search Quality group says that “If you make sure that the pages on your site link to you … then if someone scrapes you they might end up linking to you. To the extent that that’s a successful scraper or a successful spammer, those links will help you along.”

      Of course, more sophisticated scrapers or scraping software will probably remove all your links. Looking on the bright side, this is probably the only benefit of having your content scraped.

    4. Publish only a “summary” feed of your post instead of the full post
      Content scrapers very often grab your post content from your RSS feed. One way to curtail this approach is to publish only a summary of your feed. The user then clicks the hyperlinked article title to read the entire article on your blog.

      If you use WordPress, go to: Settings > Reading…

      WordPress - Feed - Full or Summary

      And set “For each article in a feed, show” to “Summary”.

      At least this way, your blog content won’t be delivered wholesale right to the scraper’s doorstep!

      Opinions vary on RSS Feed Summary vs. Full Article
      blogger Kristi Hines points out that she prefers to publish the full RSS feed because some readers don’t like having to leave their RSS reader.

      Kimberly Castleberry points out that there are ways in RSS Readers for users to force the entire article rather than a summary.

      However, I would prefer that users click over to our blog to read our posts, primarily because it provides the ability to comment, to share using the Like, Tweet, Google +1 and LinkedIn buttons, and possibly to view other content on our website or blog.

      Because we tend to post articles that provide vauable information, I’m confident that our readers will be willing to click outside the comfort of their RSS reader.

    5. Disable Pingbacks and Trackbacks on your Blog.
      Pingbacks and Trackbacks notify you that other blogs have linked to your article.

      Although it’s good to be informed when this happens, it certainly isn’t necessary for your readers to see these backlinks, and with many blog themes they are displayed after your post. In the dashboard there is an incoming links area where you can see backlinks to your posts. And you can also view backlinks via Google Webmaster Tools.

      Unfortunately, verification of Trackbacks isn’t reliable and they have been exploited by the Parasite Class to create backlinks to their websites with no value to yours.

      Worse, Google might crawl your article, see a backlink to the spammy site, visit the spammy site and see no link to your article, and determine that they’re the original author!

      Note that when you disable Pingbacks and Trackbacks on your blog, it only affects subsequent posts, NOT existing posts! Syed Balkhi has written a tutorial on disabling Pingbacks and Trackbacks on existing posts.

      Disabling Pingbacks and Trackbacks doesn’t prevent the Parasite Class from stealing your content, but at least it removes the ability to also create a backlink to their content on your blog!

    6. Add an RSS Footer to your RSS Feeds.
      Syed Balkhi of WPBeginner.com recommends that users add a line of text at the end of each post in their RSS — called an “RSS Footer”. You can do this by adding the Yoast SEO WordPress plugin, which lets you add a specific text at the end of all your posts when viewed in RSS Feeds, or you can code it yourself using Syed’s tutorial.

      Syed recommends adding a link back to your original article and blog, which let’s Google know that you’re the original source of the article.

    7. File a Digital Millennium Copyright Act (DMCA) complaint
      Google’s Matt Cutts also recommends filing a DMCA complaint.

      You can also request that Google remove content from its index.

    And, finally, there’s….

    Google’s Great New Method to Assess Content Authorship — the rel=”author” Attribute!

    Google, of course, wants to do whatever it can to minimize or eliminate content scraping, as it pollutes their search index with duplicate content. This was probably the primary driver behind the late-February 2011 Google’s “Panda” update.

    In June 2011, Google announced a new way to ascertain original authorship — the HTML attribute rel=”author”. NOTE: You need to have a Google profile for this method to work.)

     

    Preferred Method: Adding the rel=”author” attribute with HTML or by Modifying your WordPress CMS

    Step 1: Add rel=”author” to your author-page link. Every blog post has an author credit and the author’s name is usually hyperlinked to an author bio.

    If your blog has multiple authors each with their own author bio page, or you’re the only author and have an author bio page, you can modify the file “single.php” (assuming your theme has this file). Look for this bit of code: get_author_posts_url(get_the_author_ID().

    For our blog, which uses the Fusion theme, our modified file looked like this:

    <?php printf(__('by %s on','fusion'),'<a href="'. get_author_posts_url(get_the_author_ID()) .'" rel="author" title="'. sprintf(__("Posts by %s","fusion"), attribute_escape(get_the_author())).' ">'. get_the_author() .'</a>') ?>

    The above adds the rel=”author” attribute to all links to author pages.

    If this is too technical or you can’t modify the links to the author pages, then just have each author add a link to his/her author page at the bottom of each post, making sure to include the rel=”author” attribute:

    <a href="http://www.myDomain.com/blog/author/MyName/" title="My Author Page" rel="author">My Author Bio</a>

    Step 2: Add link to Google profile from author page. On each author page, add a link to the author’s Google profile (assuming they have one), and include the rel=”me” attribute.

    IMPORTANT: WordPress, by default, doesn’t allow authors to add “rel” attributes to links in their bio page, but, if you seriously trust all your authors, here’s a WordPress plugin that enables “rel” attribute in author bios.

    The link to your Google profile will look something like this:

    <a href="https://plus.google.com/xxxxxxxxxxxxxxxxx/posts" rel="me">My Google+ Profile</a>

    or

    https://profiles.google.com/[your Google ID]/about" rel="me">My Google Profile</a>.

    Alternate Method: If you are unable to add the rel=”author” attribute and/or don’t have an author bio page

    Matt Cutts and Othar Hansson provide a solution for those who can’t add the rel=”author” attribute or modify their WordPress CMS (video).

    On each post, just put a link to your Google profile and add the parameter ?rel=author to the URL:

    <a href="https://plus.google.com/xxxxxxxxxxxxxxxxx/posts?rel=author">My Google Profile+</a>

    Once you have done one of the above, then you must complete the authorship loop by linking from your Google profile to your blog. IMPORTANT: Make sure you add the “+” to your anchor text, as above!

    Step 3: Completing the Authorship Loop – Link to your Blog from your Google or Google+ profile. First, log in to your Google account:

    After logging in you’ll see your name on the right side of the black bar. Click it and then click the “Profile” link. This takes you to your profile.

    Google Login

    If you have a Google+ account, then on your profile page, you click the blue “Edit Profile” button:

    Google+ - Edit Profile

    The red bar will appear at the top, indicating you are in Edit mode. You will see this on the right side of your profile:

    Google+ Contributor to Links

    Click where the links are, next to the globe icon. If there are no links yet, you will see “Contributor to”.

    Click the globe or “Contributor to” to edit.

    Google+ Contributor to Links

    Click “Add custom link.” For each link, there’s a field for the link name and one for the URL. When you’ve added your blog link, click “Save”.

    If you don’t have a Google+ account
    It’s pretty much the same drill. Edit “Contributor to” links to add a link to your blog.

    This completes the authorship loop!

    How to see who’s scraping your blog content

    There are several ways to find out who’s scraping your blog content. Rajasekharan N. of MT Herald has an excellent article specifying four ways to do this:

    1. AdSense Allowed Sites
      Google AdSense’s Allowed Sites feature allows you to specify the sites or URLs on which you wish to have your Google ads displayed. Google: “If a URL displaying your AdSense ad code is not on your Allowed Sites list, ads will still be displayed, but impressions and clicks won’t be recorded, advertisers won’t be charged, and you won’t receive any earnings for that URL.”
    2. FeedBurner Uncommon Uses
      FeedBurner, Google’s free Web-feed management service, flags uncommon uses of your feed via the “Analyze” tab. After logging in to your FeedBurner account, you’ll see in the Analyze tab: Feed Stats > Uncommon Uses:

      FeedBurner - Uncommon Uses

      Read more about how to use the Uncommon Uses info on Google’s FeedBurner Help site.

    3. Google Webmaster Tools, Links To Your Site
      Google Webmaster Tools reports the sites linking to your posts. You should check these every so often for suspicious linking patterns.
    4. Search Google for a Specific Phrase
      Do a Google search for a specific and unique phrase from your blog article, surrounded by quotes so that Google returns only exact matches.

      For example, from this article, might search this phrase:

      “I’ve had two situations where I’ve contacted the owner and asked, civilly but forcefully, to remove my copied content, and they removed it!”

      Any sites returned in this search will have most likely copied my post, as the likelihood of using that exact phrase is so low.

    What Do YOU Think of Content Scraping?

    As I said, you’ll probably not be able to halt content scraping altogether but you can at least minimize it AND try to get some backlinks from the Parasite Class.

    Let me know in the Comments how you’ve dealt with this issue.

    Related Resources

    Comments

    1. I followed all these steps and my problem is that the offending site is still able to procure my entire posts–even with summaries selected for my site and the feed. It seems combating scrapers will be a constant as long as you write/blog on the internet. Thanks for the helpful post!

      • Yes, the methods described here are most effective against the auto-scraping apps out there.
        If someone, a human, really wants to duplicate your post on their blog, they just have to View Source and copy the code and paste it into their own post.

        • Yeah. This is unfortunate, Tim. Especially when you take the time, get permissions to use images, tools,etc. Just so a content scraper to just copy and paste the post. Thanks for the response. I did not think of ‘view source.’ Seems you can’t win for losing. Again, thanks!  

    2. James Colin says:

      I prefer the first approach: do nothing, don’t waste one second on copiers, your time on earth is so valuable that not one second should be wasted on people who don’t deserve it.

    3. I sometimes include sections of other blogs that I think are great on my blog. It’s (a) always attributed and (b)I always include it in a flow of conversation that I have in my own blog. I always thought this as a good thing, but now I’m confused as to the difference between doing that and “content scraping.”  Any clarification you can provide?

    4. Kimberly Castleberry says:

      Unfortunately, Kristi is right not only from the reader standpoint about Full Post RSS.. but that there are now dozens of tools that will make your site hand over the full feed. I have a blog post coming up on one. I use them to “pop” blogs that I want to read in reader for all the right reasons, but of course thieves use them for the wrong reasons. So this means using summary gains us little when it comes to scrapers but hurts the reader, which is unfortunate on both notes.

    5. Yup, that’s certainly one way to regard it, James. But if the scraped content starts to outrank your original content because the scraper site has more backlinks, or is an older domain, etc,, then you may be motivated to take action.

    6. No, that approach is fine, as long as the content of your post is mostly original, and you provide attribution and a backlink to the author you’re quoting.

      Content scraping is wholesale copying of your entire post into another site’s blog.

      There are sites out there that just scrape the first paragraph of two of posts, and then have a link to the original post, with anchor text along the lines of “Read the whole article at….”.

      This too is okay, in my book.
      It’s really a matter of extent and intent.

    7. Yes, Kimberly, it’s sort of an arms race. Thanks for the insight!

    8. Dori Harpaz says:

      Hi All,
      I work at a web security company called Incapsula. We are soon launching a unique anti scraping feature we developed, which should be able to block all kinds of auto anti scraping. Anyone interested in adding his website to our BETA, please send me an email to: dori@incapsula.com

    9. Hi Tim – I just had a chance to read this post.  Thanks for put something so thorough together.   

      I think Google’s Panda is still having adverse effects on some bloggers and journalists.  Here’s a post about “How to recover from Panda” that your readers might find interesting, http://go.distil.it/PXPOpB (note: this is not my blog). 

      BTW, have you ever tried to quantify the cost bloggers incur from web scrapers stealing their content?  (brand, traffic, ad revenue, wasted resources, etc…)?  

      Thanks again,
      Sean Harmer

    10. Rajasekharan says:

      Thanks Tim for mentioning my blog post at MT Herald. I noticed your article only just now while tracking the incoming links to my blog. Superb one. You have covered almost all the methods to plug the leaks.

    Speak Your Mind

    *