The crrri package provides a Chrome Remote Interface for R. It is inspired by the node.js module chrome-remote-interface.

This vignette aims to show several examples of usage for crrri.

All the examples come from the chrome-remote-interface or puppeteer documentations. This vignette shows how to reproduce those using crrri.

Setup

It is better to set up beforehand the HEADLESS_CHROME environment variable to a Chromium/Chrome binary on our system that crrri will use. If you do not, you can provide the path to a Chromium/Chrome binary in Chrome$new() or let the package guess using its find_chrome_binary().

The default behavior of crrri is equivalent to setting the environment variable like this

Sys.setenv(HEADLESS_CHROME = crrri::find_chrome_binary())

We need to load crrri and also promises to have the tools to deals with promises that crrri is based on.

library(crrri)
library(promises)

Example 1: Take a screenshot

This first example is inspired from this post that uses the chrome-remote-interface node.js package.

The first step is to launch Chromium/Chrome in headless mode:

chrome <- Chrome$new()

Then connect R to headless Chromium/Chrome with the connect() method. Since the connection process is not immediate, the connect() method returns a promise that is fulfilled when R is connected to Chrome. The value of this promise is the connection object.

client <- chrome$connect()

You need to write a function whose first parameter will receive the client connection object.

screenshot_file <- tempfile(fileext = ".png")

screenshot <- function(client) {
  # some constants
  targetUrl <- "https://cran.rstudio.com"
  viewport <- c(1440, 900)
  screenshotDelay <- 2 # seconds

  # extract the domain you need
  Page <- client$Page
  Emulation <- client$Emulation

  # enable events for the Page, DOM and Network domains 
  Page$enable() %...>% {
    # modify the viewport settings
    Emulation$setDeviceMetricsOverride(
      width = viewport[1],
      height = viewport[2],
      deviceScaleFactor = 0,
      mobile = FALSE,
      dontSetVisibleSize = FALSE
    )
  } %...>% {
    # go to url
    Page$navigate(targetUrl)
    # wait the page is loaded
    Page$loadEventFired()
  } %>%
    # add a delay 
    wait(delay = screenshotDelay) %...>% {
    # capture screenshot
    Page$captureScreenshot(format = "png", fromSurface = TRUE)
  } %...>% {
    .$data %>%
      jsonlite::base64_dec() %>%
      writeBin(screenshot_file)
  } %>%
  # close headless chrome (client connections are safely closed)
  finally(
    ~ client$disconnect()
  ) %...!% {
    cat("Error:", .$message, "\n")
  }
}

Therefore, you can take a screenshot by executing this screenshot() function:

client %...>% screenshot()

The screenshot is written to disk and looks like this:

Example 2: Dump HTML after page loaded

This example is inspired from this JavaScript script from the chrome-remote-interface wiki that dumps the DOM.

html_file <- tempfile(fileext = ".html")

client <- chrome$connect()

dump_DOM <- function(client) {
  Network <- client$Network
  Page <- client$Page
  Runtime <- client$Runtime
  Network$enable() %...>%
  { Page$enable() } %...>%
  { Network$setCacheDisabled(cacheDisabled = TRUE) } %...>%
  { Page$navigate(url = "https://github.com") } %...>%
  { Page$loadEventFired() } %...>% {
    Runtime$evaluate(
      expression = 'document.documentElement.outerHTML'
    )
  } %...>% {
    writeLines(c(.$result$value, "\n"), con = html_file)
  } %>%
  finally(
    ~ client$disconnect()
  ) %...!% {
    cat("Error:", .$message, "\n")
  }
}

Execute the task:

client %...>% dump_DOM()

Here is the first 20 lines of what we get in html_file:

<html lang="en"><head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://assets-cdn.github.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">



  <link crossorigin="anonymous" media="all" integrity="sha512-lLo2nlsdl+bHLu6PGvC2j3wfP45RnK4wKQLiPnCDcuXfU38AiD+JCdMywnF3WbJC1jaxe3lAI6AM4uJuMFBLEw==" rel="stylesheet" href="https://assets-cdn.github.com/assets/frameworks-08fc49d3bd2694c870ea23d0906f3610.css">
  <link crossorigin="anonymous" media="all" integrity="sha512-4kfWSrzu4OShEnC5m0lqUCfKkZfG7JH0ff4wnEtubTUTZqV5pS5oUMTOvWE2DDL7ttjZ9FpnZInl/0TLO3EIiA==" rel="stylesheet" href="https://assets-cdn.github.com/assets/github-6c1d4c04bb55a87b9cb81ffdbd683662.css">


  <link crossorigin="anonymous" media="all" integrity="sha512-PcJMPDRp7jbbEAmTk9kaL2kRQqg69QZ26WsZf07xsPyaipKsi3wVG0805PZNYXxotPDAliKKFvNSQPhD8fp1FQ==" rel="stylesheet" href="https://assets-cdn.github.com/assets/site-50c740d9290419d070dd6213a7cd03b5.css">

This could be useful to parse HTML with rvest after a page is loaded.

How to use crrri?

Some introductive examples

Christophe Dervieux

2020-05-28

Setup

Example 1: Take a screenshot

Example 2: Dump HTML after page loaded