When working with onpage SEO you need to write unique content to keep googlebot happy. Especially if you operate a web page that has a lot of verticals and regions and each combination have it’s own landingpage. Writing really unique content in scale is very difficult and with really unique content I mean never using the same sentence twice.

We use Excel for writing content for landingpages. The problem that I wanted to solve is that given two texts (let say containing 40-60 words each) what is the longest consecutive string that they share? This is know as the Longest common subsequence problem in computer science.

For example the two strings “Hello, my name is Niels” and “Hi! my name is John”  has a LCS ” my name is” .

I can use this as a metric saying that given texts for two different landingpages, the LCS may not be more than X words long.

To use this in our daily operation I’ve started a working on a small project that I’ve named SeoTools.

SeoTools is a small Excel plugin that when opened, the current spreadsheet gets a bunch extra functions. Here’s a guide how to find duplicate content using SeoTools:

  1. Download SeoTools.
  2. Unzip to a directory of your choice.
  3. Now open a Excel spreadsheet that contains the column with the texts you want to find duplicate content in.
  4. Then open the SeoTools.xll (Opening XLL plugins in Excel “augments” the current session).
  5. You now get a security message. Click the button to the left that says “Activate the plugin only in this session”. (translated from Swedish)
  6. Insert a column next to the column with the texts that you want to get LCS for.
  7. Use a formula like =FindDuplicateContent(A1;$A$1:$A$3) (Tip: Use F4 to make the absolute cell references of the vector.)
  8. Result:
  9. Note that LCS is a somewhat computation heavy operation and that for each use of FindDuplicateContent, LCS is calculated for each string in the vector. So don’t expect immediate results for large amount of text. You might want to set your Excel spreadsheet to calculate formulas manually.

Implementation

The LCS algorithm used is 98% based on a C# implementation I found in Wikibooks.

Tagged with:
 

6 Responses to Excel plugin for finding duplicate content in cells

  1. Mitko says:

    Hello,

    I tried to use your SEO tool for Excel – the Google Ranking, but it gives me a -1 result for any string,

    I use it like this =GooglePageRank(A1)

    It works fine for all other functions, any ideas?

    Thanks

  2. Gerry White says:

    Awesome toolset, but my approach to getting duplicate content is slightly different

    1. Highlight the row,
    2. Conditional Formatting
    3. Highlight Cell Rules – Duplicate Values
    4. You cant then filter and sort by duplicates

  3. nielsbosma says:

    Glad you liked it!

    That approach won’t find you duplicate content that are substrings if you’re working with longer strings of text.

  4. Yann says:

    This is great !
    Could you add a function to count the number of links on the page, and maybe separate the internal links (same domain), with external links (to other domains), and check if nofollowed ?

    Thanks !

  5. Yann says:

    sorry, my previous comment was meant for this page:
    http://nielsbosma.se/projects/seotools/

  6. nielsbosma says:

    Sure, I’ll add it to my todo list!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>