When working with onpage SEO you need to write unique content to keep googlebot happy. Especially if you operate a web page that has a lot of verticals and regions and each combination have it’s own landingpage. Writing really unique content in scale is very difficult and with really unique content I mean never using the same sentence twice.
We use Excel for writing content for landingpages. The problem that I wanted to solve is that given two texts (let say containing 40-60 words each) what is the longest consecutive string that they share? This is know as the Longest common subsequence problem in computer science.
For example the two strings “Hello, my name is Niels” and “Hi! my name is John” has a LCS ” my name is” .
I can use this as a metric saying that given texts for two different landingpages, the LCS may not be more than X words long.
To use this in our daily operation I’ve started a working on a small project that I’ve named SeoTools.
SeoTools is a small Excel plugin that when opened, the current spreadsheet gets a bunch extra functions. Here’s a guide how to find duplicate content using SeoTools:
- Download SeoTools.
- Unzip to a directory of your choice.
- Now open a Excel spreadsheet that contains the column with the texts you want to find duplicate content in.
- Then open the SeoTools.xll (Opening XLL plugins in Excel “augments” the current session).
- You now get a security message. Click the button to the left that says “Activate the plugin only in this session”. (translated from Swedish)
- Insert a column next to the column with the texts that you want to get LCS for.
- Use a formula like =FindDuplicateContent(A1;$A$1:$A$3) (Tip: Use F4 to make the absolute cell references of the vector.)
- Note that LCS is a somewhat computation heavy operation and that for each use of FindDuplicateContent, LCS is calculated for each string in the vector. So don’t expect immediate results for large amount of text. You might want to set your Excel spreadsheet to calculate formulas manually.
The LCS algorithm used is 98% based on a C# implementation I found in Wikibooks.