XPathOnUrl(string url; string xpath; [string attribute, string xmlHttpDownloaderConfig])

Fetches the url and returns the (vector) result from xpath expression (cached).

=XPathOnUrl(“http://google.com”;”/html/head/title”) => “Google”

The optional “attribute” allows you to retreive the value of an attribute of the specified node. Xpath “/foo/@bar” returns the inner text of node foo. To get attribute bar you specify = XPathOnUrl(url;”/foo”;”bar”).

You can use XPathOnUrl on any “XML” resource.

You can control the HTTP request (such as header and form variables) using the xmlHttpDownloaderConfig. See BuildHttpDownloaderConfig.

Note that XPath queries has to be in lower case. So “/HTML/BODY/A” wont work.

XPathOnUrl works very similar to the popular ImportXML function in Google spreadsheets, so for some inspiration of what you can achieve with XPathOnUrl :

http://www.seerinteractive.com/blog/importxml-cookbook/2011/09/17/

http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/

There’s a few differences though:

First you can’t reference attributes directly in the XPath (see above).

Secondly, ImportXML will populate the current cell and neighboring cells below with results if the XPath expression references multiple nodes. XPathOnUrl will return an Excel array.

So to achieve what you in Google spreadsheet write:

=ImportXML(“http://www.google.com/search?q=dogs”, “//h3[@class='r']/a/@href”)

(Retreives the first 10 SERP url in a search for dogs on Google.com)

In Excel+SeoTools, just writing:

=XPathOnUrl(“http://www.google.com/search?q=dogs&num=10″;”//h3[@class='r']/a”;”href”)

…in a cell will only return the first url.

The get all you perform the following steps:

  1. Select a range of [10Rows] x [1Columns] cells.
  2. Press F2 to go into formula mode.
  3. Type =XPathOnUrl(“http://www.google.com/search?q=dogs&num=10″;”//h3[@class='r']/a”;”href”)
  4. Press CTRL+SHIFT+ENTER to make it a array formula.

Working with arrays in Excel takes som practice, but quite powerful when mastered. If you don’t want to mess with arrays you can use the following trick:

=XPathOnUrl(“http://www.google.com/search?q=dogs”, “(//h3[@class='r']/a)[1]“,”href”)

This will reference the first node. To get all results in a column, you first create a column (in say column A) with values 1..10 and use the following formula:

=XPathOnUrl(“http://www.google.com/search?q=dogs”, “(//h3[@class='r']/a)["&A1&"]“,”href”)

Local files

You can also parse localfiles using XPathOnUrl using either absolute or relative filepaths:
file:///C:/path/to/file.xml
file:///path/relative/workbook.xml
file:///../../path/relative/workbook.xml

You can tell SeoTools what encoding to use, using xmlHttpDownloaderSettings.ResponseEncoding. (See BuildHttpDownloaderConfig)

This is possible in most functions that takes an url as parameter.

XPathOnUrl Cookbook

Get country where website (or ip) is hosted
Using the GeoIPService from webservicex.net and ResolveIp we can easily get the country where a website is hosted.

=XPathOnUrl(“http://www.webservicex.net/geoipservice.asmx/GetGeoIP?IPAddress=”&ResolveIp(“http://offerta.se”);”/geoip/countryname”)=>”Sweden”

For more examples, have a look in the cookbook/ folder in your SeoTools installation.

Return number of nodes

Lets say you want to retreive a the number of links in a page:

=XPathOnUrl(“http://nielsbosma.se”,”count(//a)”)

Won’t work as HtmlAgilityPack requires expression to return nodes. But as XPathOnUrl return an array with results we can use the ROWS() function in Excel to achieve this:

=ROWS(XPathOnUrl(“http://nielsbosma.se”,”//a”))

Get help with this function in the SeoTools Community Forum.

 

13 Responses to XPathOnUrl

  1. Yann says:

    Is there a way to work with arrays in multiple columns instead of multiple rows ?
    For now I’m changing the number of the node in each column but was hoping for a faster solution.

    Thanks !

  2. Ferdi says:

    Hi Yann,
    You can use the transpose formula. Just highlight the cells horizontally and as the array formula use =TRANSPOSE(ARRAY-FORMULA)

    Example in this case:
    =TRANSPOSE(XPathOnUrl(“http://www.google.de/search?q=Oktoberfest”;”//h3[@class='r']/a”))

  3. Alex says:

    I’m trying to use =XPathOnUrl(C4,”//a[contains(@href,'word-which-i-need-to-check')]“) ,where C4 – cell with URL address, to check whether the keyword is present in the actual href. Doesn’t seem to work.. Am I doing it right? or any ideas on how to fix it? Thnaks.

  4. nielsbosma says:

    Try something like:
    =XPathOnUrl(“http://offerta.se”;”//a[contains(@href,""senaste"")]“;”href”)

  5. JR says:

    In previous versions I’ve been able to use:

    //h3[@class='r']/a[@class='l']

    to qualify an anchor. This is useful when scraping results from google as it only gives you sites and not product urls.

    It doesn’t seem to work at the moment – is there a way around?

  6. JR says:

    Now sorted by using:

    =XPathOnUrl(“http://www.google.co.uk/search?num=100&q=”&B5&”&pws=0&gl=UK”, “//h3[@class='r']/a[contains(@href,""http"")]“,”href”)

    This filters the product results as they have relative rather than absolute URLs

  7. nielsbosma says:

    @JR: Great that you sorted it out. Scraping Google is tricky as the results varies from time to time.

  8. Jack says:

    hi there,
    how do I get 100 results?

  9. nielsbosma says:

    Jack: what are you trying to do?

  10. javi says:

    a tiny detail ;-)

    retrieving twitter results is very common to have http://t.co/XXXXXXXXX

    in excel cells it shows /t as [SPACE] OR [TAB]

  11. Leo says:

    Does anyone knows the XPathonURL formula to scrap Google Results after the recent update? I’ve tried different approaches with no success.

  12. Damian says:

    Hi Nils

    Sometimes I get an 503 error message, which say me that the page isn’t available, but it is.

    I get this message if I use the function xpathonurl often. Do you know why? It looks that the problem is into the function and not on my website which I’m testing.

    Thanks Damian

  13. Livia says:

    it’s like you are all bloody genia! Thanks so much Niels! This function helps me a lot. Now i need to learn why it works and then I’ll also be a bloody genius! :)
    Thanks for sharing!