Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Sunday 29 September 2019

Tips to successfully web scrape with a macro


Recently, I updated the cricket simulator that I made in grad school, which entailed gathering four years of new T20I, IPL, and ODI cricket data. That's about 1000 matches, and the website ESPNcricinfo has been dramatically updated since I first scraped it. For that matter, so have the tools in R for scraping.

However, for one part, the play-by-play commentaries, nothing was working, and I ended up relying on recording and repeating mouse-and-keyboard macros. It's crude, but the loading-as-scrolling mechanic was just too hard to deal with programmatically, even with otherwise very powerful Rselenium.

Using macros to scrape pages is a trial-and-error process, but with the following principles, you can drastically reduce the number of trials it takes to get it right.

For the macros, I used Asofttech Automation, shown in Figures 1 and 2.


Figure 1 - Editing a macro in Asofttech Automation



Figure 2 - Choosing a macro in Asofttech Automation

In the past I have used Automacrorecorder, but A.A. offers features like appending a macro or deleting parts after the fact, rather than having to record a brand new macro for each change. This could all theoretically be done with Autohotkey as well, but if you're not already very familiar with Autohotkey, then this process is probably easier with a less powerful, but more user-friendly, tool like Asofttech Automation.



Prep Principle A: Do whatever you can programmatically first.

Figure 3 is a list with two commentary URLs for each match, one for each innings, for an Indian Premier League (IPL) season. This list was generated through some string manipulation with stringr, but mostly through the getHTMLlinks() function in Rvest.

 
Figure 3 - Links to Indian Premier League commentaries

Prep Principle B: Plan out your goal and the steps needed as explicitly as possible.

We need, from each of these pages, to highlight all the text and save it into a text file so that we can filter it down to the play-by-play commentary later. We do this by…

1. Taking the first URL in the list, cutting it (and thereby removing it from the list),
2. Pasting it into navigation bar of the browser, pressing enter,
3. Waiting for the page to load, pressing 'down' or 'page down' until the end of the page is reached,
4. Selecting all (Crtl + A or command + A), and copy/pasting all of this into a waiting txt file.
 Repeat this process hundreds of times.


The main principle is to minimize variation. That means do whatever you can before recording your macro to make sure that it's doing exactly the same thing as you did when you recorded it.

Principle 1) Close every non-essential window, especially ones that might create pop-up messages. If the webpages you're scraping have pop-ups or elements that might inconsistently disrupt scraping, use a script blocker.

 Principle 2) Write down the windows that are open, and what arrangement they're in. 

As a human, you can see when, say, the 'notepad' window is on the left half of the screen instead of the right, but a macro is just blindly repeating previously recorded clicks and key presses.

 In my macro, four windows are open: A notepad with the URLs, a notepad to paste raw text into, a web browser, and Automation. They appear in my taskbar in exactly that order.

Principle 3) Use the keyboard whenever possible. (Instead of clicking and dragging to highlight text, use Crtl + A)

Principle 4) When you must use the mouse, only interact with things that will always be in the same place and that you can control.

For example: You can paste something to the navigation bar, but avoid clicking on links if possible.

Why? Because webpage layouts change often, and that link may be somewhere else later, even a few pixels of difference can result in a link not being clicked.

Principle 5) Don't give you macro opportunities to 'wander'.

Specifically, don't use navigation buttons like 'back'.

If something unusual happens like a page fails to load, or the expected page isn't there (say, due to a cancelled or abandoned match), then the back button might not bring to you your previous page, but the page BEFORE that.

Likewise, don't open the start menu during your macro, because later you might end up opening an unexpected program.

Changing the active window with alt + tab is also dangerous because that shortcut depends on the order that windows were last active, which is difficult to keep track of, and needs to be the same as it was in the 'home state' (see Principle 7).

Principle 6) When interacting with things you can't control, leave wide margins for error.
That means waiting extra time for web pages to load. In the case of scrolling webpages, it means pressing 'down' or 'page down' about twice as much as necessary for a typical web page.

Principle 7) Return to the 'home state' at the end of the macro. That means have the same window active at the end of the recorded macro as you did at the start. If possible, put the keyboard cursor is the same place as when you started. 

In my macro, each time I use a URL, I remove it from the list, so that the top line of the URL list is a new URL each loop, this allows me to set my home state to the very top-left of the URL notepad file without having to count keypresses. Instead, I just click anywhere in that notepad and press 'Crtl + Home' to get to the beginning.


Finally, have patience. I've been using macros like this for years, and this one took me 7 tries to get right, and one of those unsuccessful tries ran for 2 hours before a fault appeared. Like a lot of automation, it's tedious work until it suddenly isn't.

No comments:

Post a Comment