Saturday, 31 October 2015

A Web Crawler using R

This R-based web crawler, available here...
1. Reads the HTML of a webpage from a given address,
2. Extracts the hyperlinks from that page to other pages and files,
3. Filters out links that fail to meet given criteria (e.g. other websites, already explored, and non-html)
4. Stores the origin and destinations of the remaining links,
5. Selects a link from those discovered so far, and returns to 1.

The scraper can be used to gather raw data from thousands of pages from a website, and reveal information of the network of links between them. For example, starting just now at the front page of the National Post website, the crawler visited a news article, the main page for horoscopes, the day's horoscopes, and an article from the financial pages of the paper.

url_seed= "http://www.nationalpost.com/index.html"
hasall = "http://[a-zA-Z]+.(national|financial)post.com/"
result = web_crawl(url_seed=url_seed, hasall=hasall, Npages=5)

112 queued, 2 ://news...elections-canada-urges...
122 queued, 3 ://news...life/horoscopes/
136 queued, 4 ://news...daily-horoscope-for-tuesday-october-27-2015
136 queued, 5 ://business.financialpost.com.../where-to-get...

The core function of the scraper is getHTMLLinks() in the XML package. As long there is a webpage at the address given (including a custom 404 error), then this function downloads the HTML code and returns all the web hyperlinks found in the code as a vector of strings.

The scraper itself is two function definitions. The inner function, get_clean_links(), calls getHTMLLinks() and applies some basic and user-defined criteria to filter out undesired links. In the above example, the regular expression named 'hasall' is such a criterion. Any link that returned by get_clean_links() must match all of the patterns in parameter 'hasall', none of them in parameter 'hasnone', and at least one of the patterns in parameter 'hassome'.

On some websites, links may be recorded in relation to their origin. In these cases, the first part of each link will be missing, and you may need to put that first part of the address back in to each link to use it, which you can do by setting the parameter “prepend” to the missing part of each address.

The outer function, web_crawl(), calls get_clean_links() and passes all relevant parameters, such as filtering criteria, to it. The function web_crawl() keeps all the acceptable links that have been discovered so far and selects from those links the next page to explore. This function also handles the formatting of the network information into two data frames, called edges and nodes, which it will return after Npages web pages have been explored.

str(result$edges)
'data.frame': 243 obs. of 4 variables:
$ From: chr "http://www.nationalpo...
$ To: chr "http://www.nationalpo...
$ From_ID: int 1 1 1 1 1 1 1 1 1 1 ...
$ To_ID: int 6 7 8 9 10 11 12 13 14 15 ...


str(result$nodes)
'data.frame': 5 obs. of  4 variables:
$ address : chr "http://www.nationalp...
$ degree_total: num  114 39 37 53 0
$ degree_new  : num  114 11 13 14 0
$ samporder  : int  1 2 3 4 5

In web_crawl(), the parameter url_seed determines the first address to explore. url_seed can be a vector of web addresses, in which case the crawler will begin exploration with either the first address in the vector, or a random one depending on how queue_method is set.

The parameter queue_method determines how that link is selected; when queue_method is 'BFS' all the direct links from the page at url_seed are explored before any second-order links are explored. When queue_method is 'Random', any link discovered but not explored could be selected next with equal probability.

Finally, web_crawl() has a function which throttles the speed at which you make requests for data from web servers. This is partly for the stability of the crawler, and partly to avoid undue stress or attention from the server being scraped. The parameter 'sleeptime', which defaults to 3, sets the number of seconds that R will idle between pages.

The crawler only retains the addresses of the links, but if you want more in depth data, you can use the getURL(“address”) function in the RCurl package to get the HTML code at “address”. Only the link structure was necessary for my purposes, but a lot more is possible by making the crawler your own with the functions in the XML and Rcurl packages.

The code for this crawler was written with the help of diffuseprior ( http://www.r-bloggers.com/web-scraping-in-r/ ) and Brock Tibert ( http://www.r-bloggers.com/create-a-web-crawler-in-r/ )