Work in progress
The goal of crrri
is to provide a native Chrome Remote Interface in R using the Chrome Debugging Protocol. This is a low-level implementation of the protocol heavily inspired by the chrome-remote-interface
JavaScript library written by Andrea Cardaci.
This package is intended to R packages developers who need to orchestrate Chromium/Chrome: with crrri
, you can easily interact with (headless) Chromium/Chrome using R. We worked a lot to provide the most simple API. However, you will have the bulk of the work and learn how the Chrome DevTools Protocol works. Interacting with Chromium/Chrome using the DevTools Protocol is a highly technical task and prone to errors: you will be close to the metal and have full power (be cautious!).
This package is built on top of the websocket
and promises
packages. The default design of the crrri
functions is asynchronous: they return promises. You can also use crrri
with callbacks if you prefer.
We are highly indebted to Miles McBain for his seminal work on chradle
that inspired us. Many thanks!
First of all, you do not need a node.js
configuration because crrri
is fully written in R.
You only need a recent version of Chromium or Chrome. A standalone version works perfectly well on Windows. By default, crrri
will try to find a chrome binary on your system to use, using the find_chrome_binary()
. You can tell crrri
to use a specific version by setting the value of the HEADLESS_CHROME
environment variable to the path of Chromium or Chrome (this is the same environment variable that is used in decapitated
). You can check it is set correctly by executing Sys.getenv("HEADLESS_CHROME")
in your R console.
Otherwise, you can also use the bin
argument of the Chrome
class new()
method to provide the path directly.
chrome <- Chrome$new(bin = "<path-to-chrome-binary->")
Note that if ever you don’t know where your binary is, you can use directly the find_chrome_binary()
function, which will try to guess where your binary is (you might neeed to install the package).
This two calls are equivalent
chrome <- Chrome$new(bin = find_chrome_binary()) # the default chrome <- Chrome$new(bin = NULL)
You can install the development version of crrri
from GitHub with:
remotes::install_github('rlesur/crrri')
crrri
interactivelyThe crrri
package is a low-level interface and is not intended to be used interactively: the goal of crrri
is to provide to R developers a set of classes and helper functions to build higher levels functions.
However, you can discover headless Chrome automation interactively in your R session using crrri
. This will help you to learn the Chrome DevTools Protocol, the crrri
design and develop higher level functions.
Assuming that you have configured the HEADLESS_CHROME
environment variable (see above), you can start headless Chrome:
library(crrri) chrome <- Chrome$new()
The Chrome
class constructor is a synchronous function. That means the R session is on hold until the command terminates.
The $connect()
method of the Chrome
class will connect the R session to headless Chrome. As the connection process can take some time, the R session does not hold
1
: this is an asynchronous function. This function returns a promise which is fulfilled when R is connected to Chrome.
However, you can pass a callback function to the $connect()
method using its callback
argument. In this case, the returned object will be a connection object:
client <- chrome$connect(callback = function(client) { client$inspect() })
The $inspect()
method of the connection object opens the Chrome DevTools Inspector in RStudio (>= 1.2.1335) or in your default web browser (you can have some trouble if the inspector is not opened in Chromium/Chrome). It is convenient if you need to inspect the content of a web page because all that you need is in RStudio.
In order to discover the Chrome DevTools Protocol commands and events listeners, it is recommended to extract one of the domains
2
from the connection object:
Page <- client$Page
The Page
object represents the Page
domain. It possesses methods to send commands or listen to specific events.
For instance, you can send to Chromium/Chrome the Page.navigate
command as follows:
Page$navigate(url = "http://r-project.org")
Once the page is loaded by headless Chrome, RStudio looks like this:
You will see in the R console:
This is a promise object that is fulfilled when Chromium/Chrome sends back to R a message telling that the command was well-received. This comes from the fact that the Page$navigate()
function is also asynchronous. All the asynchronous methods possess a callback
argument. When the R session receives the result of the command from Chrome, R executes this callback function passing the result object to this function. For instance, you can execute:
Page$navigate(url = "https://ropensci.org/", callback = function(result) { cat("The R session has received this result from Chrome!\n") print(result) })
Once the page is loaded, you will see both the web page and the result object object in RStudio:
To inspect the result of a command you can pass the print
function to the callback
argument:
Page$navigate(url = "https://ropensci.org/", callback = print)
#> $frameId
#> [1] "3BB38B10082F28A946332100964486EC"
#>
#> $loaderId
#> [1] "9DCF07625678433563CB03FFF1E8A6AB"
The result object sent back from Chrome is also the value of the promises once fulfilled. Recall that if you do not use a callback function, you get a promise:
async_result <- Page$navigate(url = "http://r-project.org")
You can print the value of this promise once fulfilled with:
async_result %...>% print()
#> $frameId
#> [1] "3BB38B10082F28A946332100964486EC"
#>
#> $loaderId
#> [1] "7B2383E8F2F39273E18E4D918F1852A0"
As you can see, this leads to the same result as with a callback function.
To sum up, these two forms perform the same actions:
Page$navigate(url = "http://r-project.org", callback = print) Page$navigate(url = "http://r-project.org") %...>% print()
If you interact with headless Chrome in the R console using crrri
, these two forms are equivalent.
However, if you want to use crrri
to develop higher level functions, the most reliable way is to use promises.
Do not forget to close headless Chrome with:
chrome$close()
Since the RStudio viewer has lost the connection, you will see this screen in RStudio:
Now, you can take some time to discover all the commands and events of the Chrome DevTools Protocol. The following examples will introduce some of them.
While working interactively, you can obtain the list of available domains in your version of Chromium/Chrome.
First, launch Chromium/Chrome and connect the R session to headless Chromium/Chrome:
chrome <- Chrome$new() client <- chrome$connect(~ .x$inspect())
Once connected, you just have to print the connection object to get informations about the connection and availables domains:
client
#> <CDP CONNECTION>
#> connected to: http://localhost:9222/
#> target type: "page"
#> target ID: "9A576420CADEA9A514C5F027D30B410D"
#> <DOMAINS>
#>
#> Accessibility (experimental)
#>
#> Animation (experimental)
#>
#> ApplicationCache (experimental)
#>
#> Audits (experimental): Audits domain allows investigation of page violations and possible improvements.
#>
#> Browser: The Browser domain defines methods and events for browser managing.
#>
#> CacheStorage (experimental)
#>
#> Cast (experimental): A domain for interacting with Cast, Presentation API, and Remote Playback API functionalities.
...
These informations are directly retrieved from Chromium/Chrome: you may obtain different informations depending on the Chromium/Chrome version.
In the most recent versions of the Chrome DevTools Protocol, more than 40 domains are available. A domain is a set of commands and events listeners.
In order to work with a domain, it is recommended to extract it from the connection object. For instance, if you want to access to the Runtime
domain, execute:
Runtime <- client$Runtime
If you print this object, this will open the online documentation about this domain in your browser:
Runtime # opens the online documentation in a browser
Here is an example that produces a PDF of the R Project website:
library(promises) library(crrri) library(jsonlite) perform_with_chrome(function(client) { Page <- client$Page Page$enable() %...>% { # await enablement of the Page domain Page$navigate(url = "https://www.r-project.org/") Page$loadEventFired() # await the load event } %...>% { Page$printToPDF() } %...>% { # await PDF reception .$data %>% base64_dec() %>% writeBin("r_project.pdf") } })
All the functions of the crrri
package (commands and event listeners) return promises (as defined in the promises package) by default. When building higher level functions, do not forget that you have to deal with promises (those will prevent you to fall into the Callback Hell).
For instance, you can write a save_as_pdf
function as follow:
save_url_as_pdf <- function(url) { function(client) { Page <- client$Page Page$enable() %...>% { Page$navigate(url = url) Page$loadEventFired() } %...>% { Page$printToPDF() } %...>% { .$data %>% jsonlite::base64_dec() %>% writeBin(paste0(httr::parse_url(url)$hostname, ".pdf")) } } }
You can pass several functions to perform_with_chrome()
:
save_as_pdf <- function(...) { list(...) %>% purrr::map(save_url_as_pdf) %>% perform_with_chrome(.list = .) }
You have created a save_as_pdf()
function that can handle multiple URLs:
save_as_pdf("http://r-project.org", "https://ropensci.org/", "https://rstudio.com")
chrome-remote-interface
JS scripts: dump the DOMWith crrri
, you should be able to transpose with minimal efforts some JS scripts written with the chrome-remote-interface
node.js module.
For instance, take this JS script that dumps the DOM:
const CDP = require('chrome-remote-interface');
CDP(async(client) => {
const {Network, Page, Runtime} = client;
try {
await Network.enable();
await Page.enable();
await Network.setCacheDisabled({cacheDisabled: true});
await Page.navigate({url: 'https://github.com'});
await Page.loadEventFired();
const result = await Runtime.evaluate({
expression: 'document.documentElement.outerHTML'
});
const html = result.result.value;
console.log(html);
} catch (err) {
console.error(err);
} finally {
client.close();
}
}).on('error', (err) => {
console.error(err);
});
Using crrri
, you can write:
library(promises) library(crrri) async_dump_DOM <- function(client) { Network <- client$Network Page <- client$Page Runtime <- client$Runtime Network$enable() %...>% { Page$enable() } %...>% { Network$setCacheDisabled(cacheDisabled = TRUE) } %...>% { Page$navigate(url = 'https://github.com') } %...>% { Page$loadEventFired() } %...>% { Runtime$evaluate( expression = 'document.documentElement.outerHTML' ) } %...>% (function(result) { html <- result$result$value cat(html, "\n") }) } perform_with_chrome(async_dump_DOM)
If you want to write a higher level function that dump the DOM, you can embed the main part of this script in a function:
dump_DOM <- function(url) { perform_with_chrome(function(client) { Network <- client$Network Page <- client$Page Runtime <- client$Runtime Network$enable() %...>% { Page$enable() } %...>% { Network$setCacheDisabled(cacheDisabled = TRUE) } %...>% { Page$navigate(url = url) } %...>% { Page$loadEventFired() } %...>% { Runtime$evaluate( expression = 'document.documentElement.outerHTML' ) } %...>% (function(result) { html <- result$result$value cat(html, "\n") }) }) }
Now, you can use it for dumping David Gohel’s blog:
dumpDOM(url = "http://www.ardata.fr/blog/")
You can find many other examples in the wiki of the chrome-remote-interface
module.
In crrri
, there are two types of messages:
crrri
uses debugme
for printing those messages. It is disable by default and you won’t see any messages
You need to add "crrri"
to the DEBUGME
environment variable before loading the package to activate the messaging feature. Currently in crrri
there is only one level of message.Also, debugme
is a Suggested dependency and you may need to install it manually if not already installed.