library(rvest)
library(tidyverse)
library(knitr)
library(plyr)
library(dplyr)
library(jsonlite)
library(lubridate)
library(RSelenium)
We see two methods to capture departures and arrivals data for airport on FlightRadar website, using a browser running on a server, and using an XHR request.
For each airports page, FlightRadar website offer the possibility to see general informations, departures and arrivals flights information. For this tutorial we try to scrape the Bordeaux Mérignac Airport BOD arrival flights page
As you could see if you go to departures pages, you have two interesting buttons, one at the top of the page, and one at the bottom of the page.
To display all data available (something like 24h of past and future departures/arrivals), we simulate multiples clic on this two buttons, and we stop this behavior only when these buttons disapears from the page.
At the end of this tutorial, we present how to use Docker container technology to export our webscrapping script on some remote server for an autonomous execution 24/24h.
Due to some defense created by webmaster to protect their data (test javascript, user-agent, infinite loading, etc.), you need to simulate an human behavior, if possible using a real browser.
To be short, Selenium is a multi-tools project focusing on task automation to test web aplication. It works with lots of Internet browsers, and lot of operating systems.
Selenium Webdriver give to developper an API to interact/pilot an headless internet browser without opening it. So, you, developper, you could use this API with your favorite langage (Java, Python, R, etc.) to sent commands to browser in order to navigate, move your mouse, click on DOM element, sent keyboard output to input forms, inject javascript, capture image of the page, extract html, etc.
First, you need to install and load RSelenium package, the R bindings library for Selenium Webdriver API :
install.packages("devtools")
devtools::install_github("ropensci/RSelenium")
Depending of your existing configuration and OS you probably need to install some dependent software packages.
It’s possible to use directly Selenium in connection with your browser, but we prefer to use directly a server version. Why ? Because using server version of Selenium, you have the possibility :
Selenium is a fast moving project, and some release are really buggy, so try to choose a stable version, and don’t desperate.
!! Before continuing, read the documentation on Docker at the bottom of this document, it explain what is really Docker/images/container, and it explain how to install images/containers on your system !!
When it’s done, we pull and run
one of Docker Selenium-Server image using terminal. For this tutorial we use Firefox !
In classic context (good internet connection), we pull images directly from the Docker Hub server, a central repository like CRAN for R.
sudo docker pull selenium/standalone-firefox:3.14.0-arsenic
But, because the image is heavy in size (1 GO for the two images used in this tutorial), we prefer to directly load the image given by USB key by your teachers. Open a terminal on the folder where located the images.
sudo docker load --input=r-alpine.tar
sudo docker load --input=rSelenium.tar
Create the Selenium container which contain Firefox :
sudo docker run --shm-size=2g --name selenium -d -p 4445:4444 selenium/standalone-firefox:3.14.0-arsenic
Type sudo docker ps
to see if server correctly run and listen to port 4445
Connect and open the browser on the server.
user_agent_list = c ("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0", "Mozilla/5.0 (Windows NT 6.1; rv:27.3) Gecko/20130101 Firefox/27.3", "Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0", "Mozilla/5.0 (Windows NT 6.2; Win64; x64;) Gecko/20100101 Firefox/20.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.90 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.04 Chromium/17.0.963.56 Chrome/17.0.963.56 Safari/535.11")
fprof <- makeFirefoxProfile(list(general.useragent.override=sample(user_agent_list,1)))
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, extraCapabilities = fprof )
remDr$open()
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
##
## $browserName
## [1] "firefox"
##
## $browserVersion
## [1] "61.0.1"
##
## $`moz:accessibilityChecks`
## [1] FALSE
##
## $`moz:headless`
## [1] FALSE
##
## $`moz:processID`
## [1] 73
##
## $`moz:profile`
## [1] "/tmp/rust_mozprofile.hYVGLKxPdfkd"
##
## $`moz:useNonSpecCompliantPointerOrigin`
## [1] FALSE
##
## $`moz:webdriverClick`
## [1] TRUE
##
## $pageLoadStrategy
## [1] "normal"
##
## $platformName
## [1] "linux"
##
## $platformVersion
## [1] "4.15.0-34-generic"
##
## $rotatable
## [1] FALSE
##
## $timeouts
## $timeouts$implicit
## [1] 0
##
## $timeouts$pageLoad
## [1] 300000
##
## $timeouts$script
## [1] 30000
##
##
## $webdriver.remote.sessionid
## [1] "413c56f3-b518-4c34-8aeb-37f2fb3cd9c9"
##
## $id
## [1] "413c56f3-b518-4c34-8aeb-37f2fb3cd9c9"
remDr$maxWindowSize()
remDr$executeScript("return navigator.userAgent;", list(""))
## [[1]]
## [1] "Mozilla/5.0 (X11; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0"
Johnd Harrison, the creator and first commiter of RSelenium binding library for Selenium, create a big tutorial with lots of commands covered : https://rpubs.com/johndharrison/RSelenium-Basics
Some of them :
remDr$maxWindowSize()
: maximize windows of the browser.remDr$navigate("https://www.google.fr")
: navigate to urlremDr$screenshot(display = TRUE)
: take a screenshoot of the webpage and display it in RStudio ViewerremDr$findElement(...)
: Find and element in the html structure, using different method : xpath, css, etc.remDr$executeScript(...)
: Execute a js script in the remote browserremDr$clickElement(...)
: Clic on elementOpen Web Developer tools
in your favorite browser on the arrivals webpage of BOD : https://www.flightradar24.com/data/airports/bod/arrivals
We investigate what happens in the html code when we clic the load earlier or load later button. Why we do that ? To understand how we could automate things later.
Because we want to automate clic on this two buttons, we need to understand WHEN we need to stop clicking :) If we clic an infinite number of time, an error probably trigger when one of the two button disapear.
Select the Selector tools and click on the load earlier button.
If you clic the right thing, normaly you have highlighted some part of the html code which interest us :
Now, Iif you highlight and clic with the web tool selector on the load later flights button, you have something like that :
Things are not so very differences between this two buttons objects. It seems that only the timestamp, the data page number and the button text change …
Hightlight and clic one more time on the load earlier flights button. Clic another time to load a new page of data. You see that the html code change during the data load to desactivate clic on the button. Not so interesting. Now repeat the clic and stop only when the button disapear on your screen.
Great, a new css style attribute appear to indicate that now this button object is hidden : style="display: none;"
How could we re-use this important information during data harvesting to detect if the button is activated/desactivated ? The best solution was to use XPATH query !
Load the page in the selenium server
remDr$navigate("https://www.flightradar24.com/data/airports/bod/arrivals")
Sys.sleep(5) # time to load !
remDr$screenshot(file = "screenshoot.png")
Building XPATH correct expression could be difficult. A good way to test validity of your XPATH expressions was to use an interactive way, using the web developper console. There are some good cheatsheet pages which resume all the possibilities of XPATH : 1 2
Clic on console tab :
Type this in the console : $x("//button[@class='btn btn-table-action btn-flights-load']")
The result is an interactive array you could develop as a tree if you want.
Clic Clic Clic to make disapear one of the loading button, and now we trying to select only the available button. XPATH understand boolean operator (or,and, etc.) so we filter by @class
and style
:
$x("//button[@class='btn btn-table-action btn-flights-load' and contains(@style,'display: none;')]")
Great, this query return only the valid button. We use later this query to stop our loop of infernal button clic.
Now we try to build this query using RSelenium with findElement()
function :
loadmorebutton <- remDr$findElements(using = 'xpath', "//button[@class='btn btn-table-action btn-flights-load' and not(contains(@style,'display: none;'))]")
Display the text of each element retrieved by function findElements()
using the getElementText() function
unlist(lapply(loadmorebutton, function(x){x$getElementText()}))
## [1] "Load earlier flights" "Load later flights"
Now, how to simulate a clic on one of this button ?
An easy way was to call clickElement()
function on the first loadmorebutton webelement :
tryCatch({
suppressMessages({
loadmorebutton[[1]]$clickElement()})},
error = function(e) {
loadmorebutton[[1]]$errorDetails()$message
})
This command return an error message (if not, you’re lucky !), not very explicit, so if you want more details, you could call the function errorDetails()
like our trycatch
block.
An element of the webpage overlapp our button, so browser say us that’s not possible to clic on this webelement. Use snapshot function to see the page :
remDr$screenshot(file = 'screenshoot_overlap.png' )
If we hide these elements using XPath and javascript injection, everything goes to normal. First we accept cookies.
hideCookie <- function (x){
cookiesButton <- x$findElement(using = 'xpath',"//div[@class='important-banner__close']")
cookiesButton$clickElement()
}
hideCookie(remDr)
remDr$screenshot(file = 'screenshoot_hide.png')
The navbar element create problem, so we hide it using javascript injection :
hideNavBar <- function (x) {
script <- "document.getElementById('navContainer').hidden = true;"
x$executeScript(script)
}
hideNavBar(remDr)
## list()
Now you can clickElement()
without problem :)
tryCatch({
suppressMessages({
loadmorebutton[[1]]$clickElement()})},
error = function(e) {
remDr$errorDetails()$message
})
See changes before and after using remDr$screenshot(display = TRUE)
command
Sometimes, a defense is also a point of vulnerability. Many site use some sort of internal API to query and feed their website, it’s an easy way for developpers to distribute the data. But for us, this is also a perfect data SOF (Single Point Of Failure).
We try to see if this is the case with flight radar :)
Open the dev tools in the browser, clic on Network tab, then XHR tab.
Lucky guy/girl, do you see it ? Each GET
query call an aiport.json
file on the server :
https://api.flightradar24.com/common/v1/airport.json?code=bod&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]=1537297562&page=1&limit=100&token=
If we decompose the query, we have :
Copy paste this url in your browser to see how the result json is structured. Insteresting data is located into schedule result > response > airport > arrivals
:
We have the capacity to generate custom query to download data at custom timestamp. This query return data structured in json , so we try to convert this data to data.frame using the jsonlite wonderfull package :) Why wonderfull ? Because jsonlite had an option to flatten the structure of json which normally contain data.frame into data.fram into data.frame …
timestamp <- as.numeric(as.POSIXct(now()))
url <- paste("https://api.flightradar24.com/common/v1/airport.json?code=bod&plugin[]=&plugin-setting[schedule][mode]=&plugin-setting[schedule][timestamp]=",timestamp,"&page=1&limit=100&token=",sep="")
# https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html
json <- jsonlite::fromJSON(url,flatten = T)
We extract information for the first page of Arrivals data collected by airport from json.
pageOfData <- json$result$response$airport$pluginData$schedule$arrivals$data
filteredData <- pageOfData %>% select(flight.airline.code.icao, flight.airline.name, flight.airport.origin.name, flight.airport.origin.code.icao, flight.airport.origin.position.latitude, flight.airport.origin.position.longitude)
filteredData <- rename(filteredData, c(flight.airline.code.icao = "ICAO", flight.airline.name= "Name", flight.airport.origin.name = "Origin", flight.airport.origin.code.icao="Origin ICAO", flight.airport.origin.position.latitude = "Latitude",flight.airport.origin.position.longitude = "Longitude" ))
knitr::kable(filteredData, caption = "page 1 of arrival for BOD")
ICAO | Name | Origin | Origin ICAO | Latitude | Longitude |
---|---|---|---|---|---|
CLG | Chalair Aviation | Nantes Atlantique Airport | LFRS | 47.15694 | -1.607770 |
AFR | Air France | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
CLG | Chalair Aviation | Brest Bretagne Airport | LFRB | 48.44722 | -4.421660 |
TUI | TUI | Casablanca Mohammed V International Airport | GMMN | 33.36746 | -7.589960 |
EZY | easyJet | Marseille Provence Airport | LFML | 43.43666 | 5.215000 |
AFR | Air France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
KLM | KLM | Amsterdam Schiphol Airport | EHAM | 52.30861 | 4.763889 |
HOP | HOP! | Lille Airport | LFQQ | 50.56333 | 3.086944 |
VOE | Volotea | Bastia Poretta Airport | LFKB | 42.55000 | 9.484722 |
VOE | Volotea | Palermo Falcone-Borsellino Airport | LICJ | 38.17595 | 13.091010 |
VOE | Volotea | Luqa Malta International Airport | LMML | 35.85749 | 14.477500 |
FPO | ASL Airlines France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
HOP | HOP! | Marseille Provence Airport | LFML | 43.43666 | 5.215000 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
CLG | Chalair Aviation | Rennes Saint-Jacques Airport | LFRN | 48.07194 | -1.732220 |
HOP | HOP! | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
VOE | Volotea | Strasbourg Airport | LFST | 48.54361 | 7.637222 |
CLG | Chalair Aviation | Nantes Atlantique Airport | LFRS | 47.15694 | -1.607770 |
EZY | EasyJet | Nice Cote d’Azur Airport | LFMN | 43.66527 | 7.215000 |
IBE | Iberia | Madrid Barajas Airport | LEMD | 40.49355 | -3.566760 |
SWR | Swiss | Zurich Airport | LSZH | 47.46472 | 8.549167 |
BAW | British Airways | London Gatwick Airport | EGKK | 51.14805 | -0.190270 |
AFR | Air France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
HOP | HOP! | Lille Airport | LFQQ | 50.56333 | 3.086944 |
HOP | HOP! | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
EZY | EasyJet | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
CLG | Chalair Aviation | Brest Bretagne Airport | LFRB | 48.44722 | -4.421660 |
CLG | Chalair Aviation | Montpellier Mediterranee Airport | LFMT | 43.58333 | 3.961389 |
AFR | Air France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
EZY | EasyJet | London Gatwick Airport | EGKK | 51.14805 | -0.190270 |
KLM | KLM | Amsterdam Schiphol Airport | EHAM | 52.30861 | 4.763889 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
PBD | Pobeda | Moscow Vnukovo International Airport | UUWW | 55.59153 | 37.261478 |
VOE | Volotea | Split Airport | LDSP | 43.53894 | 16.297960 |
TSC | Air Transat | Montreal Pierre Elliott Trudeau Airport | CYUL | 45.47055 | -73.740799 |
CLG | Chalair Aviation | Nantes Atlantique Airport | LFRS | 47.15694 | -1.607770 |
RYR | Ryanair | Brussels South Charleroi Airport | EBCI | 50.46000 | 4.452778 |
VLG | Vueling | Palma de Mallorca Airport | LEPA | 39.55167 | 2.738808 |
EZY | EasyJet | Bristol Airport | EGGD | 51.38266 | -2.719080 |
THY | Turkish Airlines | Istanbul Ataturk International Airport | LTBA | 40.97692 | 28.814600 |
EZY | EasyJet | Belfast International Airport | EGAA | 54.65750 | -6.215830 |
VLG | Vueling | Malaga Costa Del Sol Airport | LEMG | 36.67490 | -4.499100 |
BEE | Flybe | Birmingham Airport | EGBB | 52.45385 | -1.748020 |
AFR | Air France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
EZY | EasyJet | Milan Malpensa Airport | LIMC | 45.63060 | 8.728111 |
HOP | HOP! | Nice Cote d’Azur Airport | LFMN | 43.66527 | 7.215000 |
VOE | Volotea | Santorini Thira National Airport | LGSR | 36.39916 | 25.479330 |
JAF | TUI fly Belgium | Corfu International Airport | LGKR | 39.60194 | 19.911659 |
EZY | EasyJet | Lisbon Humberto Delgado Airport | LPPT | 38.78131 | -9.135910 |
EZY | EasyJet | Palma de Mallorca Airport | LEPA | 39.55167 | 2.738808 |
VOE | Volotea | Mahon Menorca Airport | LEMH | 39.86259 | 4.218647 |
HOP | HOP! | Dusseldorf International Airport | EDDL | 51.28945 | 6.766775 |
VLG | Vueling | Barcelona El Prat Airport | LEBL | 41.29707 | 2.078463 |
RYR | Ryanair | London Stansted Airport | EGSS | 51.88500 | 0.235000 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
VOE | Volotea | Dubrovnik Airport | LDDU | 42.56135 | 18.268240 |
TAP | TAP Portugal | Lisbon Humberto Delgado Airport | LPPT | 38.78131 | -9.135910 |
EIN | Aer Lingus | Dublin Airport | EIDW | 53.42138 | -6.270000 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
EZY | EasyJet | Barcelona El Prat Airport | LEBL | 41.29707 | 2.078463 |
HOP | HOP! | Marseille Provence Airport | LFML | 43.43666 | 5.215000 |
KLM | KLM | Amsterdam Schiphol Airport | EHAM | 52.30861 | 4.763889 |
EZY | EasyJet | Rhodes International Airport | LGRP | 36.40541 | 28.086189 |
CLG | Chalair Aviation | Rennes Saint-Jacques Airport | LFRN | 48.07194 | -1.732220 |
EZY | EasyJet | Basel Mulhouse-Freiburg EuroAirport | LFSB | 47.59890 | 7.528300 |
EZY | EasyJet | London Luton Airport | EGGW | 51.87472 | -0.368330 |
VOE | Volotea | Strasbourg Airport | LFST | 48.54361 | 7.637222 |
HOP | HOP! | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
EZY | EasyJet | Lille Airport | LFQQ | 50.56333 | 3.086944 |
AFR | Air France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
HOP | HOP! | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
EZY | EasyJet | Berlin Schonefeld Airport | EDDB | 52.38000 | 13.522500 |
IBE | Iberia | Madrid Barajas Airport | LEMD | 40.49355 | -3.566760 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
DLH | Lufthansa | Frankfurt Airport | EDDF | 50.02642 | 8.543125 |
BEL | Brussels Airlines | Brussels Airport | EBBR | 50.90138 | 4.484444 |
EZY | EasyJet | Geneva International Airport | LSGG | 46.23806 | 6.108950 |
CLG | Chalair Aviation | Brest Bretagne Airport | LFRB | 48.44722 | -4.421660 |
VOE | Volotea | Malaga Costa Del Sol Airport | LEMG | 36.67490 | -4.499100 |
HOP | HOP! | Lille Airport | LFQQ | 50.56333 | 3.086944 |
FPO | ASL Airlines France | Faro Airport | LPFR | 37.01442 | -7.965910 |
SWR | Swiss | Zurich Airport | LSZH | 47.46472 | 8.549167 |
CLG | Chalair Aviation | Brest Bretagne Airport | LFRB | 48.44722 | -4.421660 |
EZY | EasyJet | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
CLG | Chalair Aviation | Montpellier Mediterranee Airport | LFMT | 43.58333 | 3.961389 |
HOP | HOP! | Marseille Provence Airport | LFML | 43.43666 | 5.215000 |
EZY | EasyJet | Nice Cote d’Azur Airport | LFMN | 43.66527 | 7.215000 |
AFR | Air France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
EZY | easyJet (Europcar Livery) | Amsterdam Schiphol Airport | EHAM | 52.30861 | 4.763889 |
VOE | Volotea | Toulon-Hyeres Airport | LFTH | 43.09734 | 6.146031 |
AFR | Air France | Paris Orly Airport | LFPO | 48.72333 | 2.379444 |
HOP | HOP! | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
EZY | EasyJet | Lyon Saint Exupery Airport | LFLL | 45.71964 | 5.089108 |
FPO | ASL Airlines France | Paris Charles de Gaulle Airport | LFPG | 49.01252 | 2.555752 |
CLG | Chalair Aviation | Nantes Atlantique Airport | LFRS | 47.15694 | -1.607770 |
EZY | easyJet | Brussels Airport | EBBR | 50.90138 | 4.484444 |
This is the ultimate and probably the most complex part of this big tutorial.
In real webscraping project, there are two possible use case :
Take a very practical example, if you need to collect one year of data on a daily basis you cannot use your personnal computer. You need to connect and run your program from a distant server (somewhere on internet).
To be really really short on subject, Docker is a technology which encapsulate one or multiple software with all theirs dependencies into an isolated (and if possible immutable) container which run on top of any system interoperable with Docker tools (Window/Linux/Mac). You could run multiple isolated containers (A, B , …), with capacity to exchange informations using a common dedicated local network, on the same machine.
It was a very interesting technology because if you develop a program in a container on your local machine, it works on any server compatible with Docker. If you know the concept of Virtual Machine (VM), the idea is the same, but Docker is a more efficient technology (see this comparison on Docker official documentation).
Lena create an independent container which contain :
Two use case are possible :
If Lena want to share container MyContainer
to paul, she (command docker export --output myContainer.tar myContainer
) the container on a USB key. Later, Paul copy and load the container MyContainer
on his machine (command docker load
).
If Lena want to run MyContainer
container on some server on the web which contain docker program, she export the container(command docker --output myContainer.tar myContainer
), copy this container on the server (using FTP protocol for example) and load this container (command docker load --input myContainer.tar
) on the server.
Be careful 1 if your container contain a Volume you need a special procedure to migrate it.
Be careful 2 If your container need another container to work, you need to export both.
Here we are, we use this Docker technology to encapsulate our webscrapping script into one portable container. After that you could save yours and launch it on any webserver which run Docker.
There are three big step to understand in containers lifecycle:
First, we describe the composition of an image into a Dockerfile file using a special Docker syntax. It’s like a recipe into cookbook. For example, you could find lot of recipes for general software on this site : DockerHub.
Next, like a recipe in the real life, you need to concretize this recipe into some delicious cake. Image need to be built before usage. From an Image/Recipe of a cake we create a Container/Cake
Finally, you run the builted image.
You could find lot of other informations on the web, but also on this online tutorial which resume lot of commands.
In this tutorial, we give you the corresponding Dockerfile
which contain the recipe to build a Container
ready to scrape the flightradar website. All script to do that are in docker-scripts
folder.
At the end of the tutorial you have this architecture, with two container (RSelenium and Flightscrapradar) which communicate to scrap data on flightradar website.
On linux Ubuntu, you found documentation here. First step, install the key and repository.
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
software-properties-common
Add key and repository :
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get update
Install docker-ce :
sudo apt-get install docker-ce
You are ready to jump to the tutorial which correspond to your OS.
There are two way to install Docker for Windows and Mac, a new way Windows / Mac and an old way. For this tutorial we use the the old way due to better compatibility.
Install Docker Tools for windows using the DockerToolbox.exe
(or .dmg
for mac) file. You could find the official documentation is available here Windows and Mac
After that you could launch Docker quickstart terminal directly after installation or using the icon in start menu.
Docker first download an iso, and after that test if your system is ready to run containers. If you see an error like this, you need to run another step.
Restart your computer, and try to activate an option in the BIOS (Del key during initialization of your computer) probably named “Vanderpool technology” or “VT-X technology” or “Virtualization technology”. Save and restart. Some pictore for UEFI Bios on HP, DELL, ASUS motherboard/systems.
Asus
Dell
HP
You are ready to jump to the tutorial which correspond to your OS.
Copy the folder docker-images
on the USB Key (ask teachers) into the scrap-flightradar
folder of this tutorial.
Now, go to this folder using terminal command (cd pathofthefolder
), and load the two images on your system.
sudo docker load --input=r-alpine.tar
sudo docker load --input=rSelenium.tar
BUILD image
Go to docker-scripts
folder into the folder which contain this tutorial on your disk.
The building of this image take lot of times (ten minutes), this is due to the huge dplyr library. Run the docker build
command in the folder which contain the Dockerfile
description of the image.
docker build . --tag=rflightscraps
LAUNCH Container
localbackup
and run the container rflightscraps
with correct path.mkdir localbackup
docker run --name rflightscraps -d -e UID=1000 -e GID=1000 --mount type=bind,source=$(pwd)/localbackup,destination=/usr/local/src/flight-scrap/docker-scripts/data rflightscraps
To see if your container is running and consult the logs of execution :
sudo docker ps
sudo docker logs rflightscraps
To consult the result of automatic harvesting, consult the docker-scripts/localbackup
folder using ls
unix command. You see a list of csv which correspond to harvest made every minute. If you want to change this, you need to modify the crontab
file following the cron syntax, and rebuild/relaunch the image (it take less time, because you only modify one file, no need to recompile).
Create a named volume, independent from filesystem
docker volume create --name myDataVolume
docker volume ls
Mount the volume :
docker run --mount type=volume,source=myDataVolume,destination=/usr/local/src/flight-scrap/docker-scripts/data rflightscraps
Export data :
alpine
image, we mount the named volume (myDataVolume
) to a /alpine_data
folder inside the alpine
container.alpine
container named /alpine_backup
./alpine_data
folder and we store it inside the /alpine_backup
folder (inside the container)./alpine_backup
folder from the container to the docker host (your local machine) in a folder named /local_backup
inside the current directory.docker run --rm -v myDataVolume:/alpine_data -v $(pwd)/local_backup:/alpine_backup alpine:latest tar cvf /alpine_backup/scrap_data_"$(date '+%y-%m-%d')".tar /alpine_data
EXPORT Containers
If you don’t have a server yet, you could buy one with docker already installed for cheap / month : - https://www.digitalocean.com/products/one-click-apps/docker/ - https://www.ovh.com/fr/vps/vps-cloud.xml
PREPARE image
Copy docker-scripts
and docker-images
folders into c:\Program Files\Docker Toolbox
After that, into Terminal of Docker Toolbox you see this folders.
Go to docker-images
folder using cd
command, and load the two images :
docker load --input=r-alpine.tar
docker load --input=rSelenium.tar
BUILD image
Go to docker-scripts
folder into the folder which contain this tutorial on your disk.
The building of this image take lot of times (ten minutes), this is due to the huge dplyr library. Run the docker build
command in the folder which contain the Dockerfile
description of the image.
docker build . --tag=rflightscraps
LAUNCH container
We use a binded volume, this is the easiest way actually.
First, create a new folder named localbackup
into your users folder on windows : C:\Users\yourname
After that, change the path by yours in this command and run it.
docker run --name rflightscraps -d -e UID=1000 -e GID=1000 --mount type=bind,source=/c/Users/reyse/localbackup,destination=/usr/local/src/flight-scrap/docker-scripts/data rflightscraps
The end, close the session !
remDr$close()