class: center, middle, inverse, title-slide # Programming Tools in Data Science ## Lecture #7: webscraping with R ### Samuel Orso ### 18 October 2022 --- # Webscraping with R ```r library(rvest) url <- "https://ptds.samorso.ch/lectures/" read_html(url) %>% html_table() %>% .[[1]] %>% .[5:7,] %>% kableExtra::kable() ``` <table> <thead> <tr> <th style="text-align:right;"> Week </th> <th style="text-align:left;"> Date </th> <th style="text-align:left;"> Topic </th> <th style="text-align:left;"> Instructor </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> 18 Oct </td> <td style="text-align:left;"> Function I, project proposal, webscraping </td> <td style="text-align:left;"> Samuel </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> 25 Oct </td> <td style="text-align:left;"> Exercise and Homework 3 </td> <td style="text-align:left;"> Aleksandr </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> 1 Nov </td> <td style="text-align:left;"> Group project </td> <td style="text-align:left;"> Aleksandr </td> </tr> </tbody> </table> --- <center> <div style="width:600px"><iframe allow="fullscreen" frameBorder="0" height="375" src="https://giphy.com/embed/LNkZr3BhUhQvo92eRO/video" width="600"></iframe></div> </center> --- # Setup * For this class, you will need (at least) the following packages: ```r install.packages(c("rvest","magrittr")) ``` * You need a web browser (Chrome, Firefox, ...) --- # API * **A**pplication **P**rogramming **I**nterface are gold standard for fetching data from the web * Data is fetched by directly posing HTTP requests. * Data requests from `R` using `library(httr)` or API wrappers. <table> <thead> <tr> <th style="text-align:left;"> Provider </th> <th style="text-align:left;"> Registration </th> <th style="text-align:left;"> Wrapper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Twitter </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> Financial Times </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> Open Weather Map </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:left;"> DeepL </td> <td style="text-align:left;"> TRUE </td> <td style="text-align:left;"> TRUE </td> </tr> </tbody> </table> --- # API example ```r library(pageviews) top_articles("en.wikipedia", start = (Sys.Date()-1)) %>% dplyr::select(article, views) %>% dplyr::top_n(10) ``` ``` ## Selecting by views ``` ``` ## article views ## 1 Main_Page 4934355 ## 2 Special:Search 1319645 ## 3 Jeffrey_Dahmer 494099 ## 4 House_of_the_Dragon 378321 ## 5 2022_Ballon_d'Or 252229 ## 6 Kantara_(film) 234348 ## 7 Ballon_d'Or 205749 ## 8 Cleopatra 184599 ## 9 The_Watcher_(2022_TV_series) 180767 ## 10 Halloween_Ends 171917 ``` --- # API example ```r library(deeplr) deeplr::translate2( text = "Mais quelle bonne traduction nom d'une pipe!", target_lang = "EN", auth_key = my_key ) ``` ``` ## [1] "But what a good translation, by golly!" ``` This is what I obtain on Google translate: > But what a good translation of the name of a pipe! --- # HTTP request/response cycle <img src="images/http_request_response.png" width="1680" /> --- # HyperText Markup Language ```html <!DOCTYPE html> <html> <body> <h1 id='first'>Webscraping with R</h1> <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> </body> </html> ``` .bottom[[Try it!](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_default)] --- # HTML * **element** starts with `<tag>` and ends `</tag>`, * it has optional **attributes** (`id=attribute`), * **content** is everything between two tags. .center[ Add the attribute `style="background-color:DodgerBlue;"` to `h1` and try it. ] --- # HTML elements tag | meaning --- | --- p | Paragraph h1 | Top-level heading h2, h3, ... | Lower level headings ol | Ordered list ul | Unorder list li | List item img | Image a | Anchor (Hyperlink) div | Section wrapper (block-level) span | Text wrapper (in-line) Find out more [tags](https://developer.mozilla.org/en-US/docs/Web/HTML) --- # CSS ```html <!DOCTYPE html> <html> <head> <style> body { background-color: lightblue; } h1 { color: white; text-align: center; } .content { font-family: monospace; font-size: 1.5em; color: black; } #intro { background-color: lightgrey; border-style: solid; border-width: 5px; padding: 5px; margin: 5px; text-align: center; } </style> </head> <body> ... ``` --- # Data extraction Create a HTML page with `minimal_html` for experimenting ```r html_page <- minimal_html(' <body> <h1>Webscraping with R</h1> <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> </body>') ``` --- # HTML elements ```html ... <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> * <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> ... ``` ```r html_page %>% html_nodes("li") ``` ``` ## {xml_nodeset (3)} ## [1] <li>HTML: <em>Hypertext Markup Language</em>\n</li> ## [2] <li>CSS: <em>Cascading Style Sheets</em>\n</li> ## [3] <li>rvest</li> ``` ```r html_page %>% html_nodes("li") %>% html_text() ``` ``` ## [1] "HTML: Hypertext Markup Language" "CSS: Cascading Style Sheets" ## [3] "rvest" ``` --- ```html <p> Basic experience with <a href="www.r-project.org">R</a> and * familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> * <li>HTML: <em>Hypertext Markup Language</em></li> * <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: * <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("em") %>% html_text() ``` ``` ## [1] "Tidyverse" "Hypertext Markup Language" ## [3] "Cascading Style Sheets" "rvest" ## [5] "tidyverse" ``` --- # CSS selector selector | meaning --- | --- , | grouping space | descendant > | child + | adjacent sibling ~ | general sibling :first-child | first element :nth-child(n) | n element :last-child | last element . | class selector # | id selector .center[[CSS diner](https://flukeout.github.io/) [CSS selector](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors)] --- ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li, em") %>% html_text() ``` ``` ## [1] "Tidyverse" "HTML: Hypertext Markup Language" ## [3] "Hypertext Markup Language" "CSS: Cascading Style Sheets" ## [5] "Cascading Style Sheets" "rvest" ## [7] "rvest" "tidyverse" ``` --- ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li em") %>% html_text() ``` ``` ## [1] "Hypertext Markup Language" "Cascading Style Sheets" ``` --- ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("p > em") %>% html_text() ``` ``` ## [1] "Tidyverse" "rvest" "tidyverse" ``` --- ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("p + em") %>% html_text() ``` ``` ## character(0) ``` ```r html_page %>% html_nodes("em + em") %>% html_text() ``` ``` ## [1] "tidyverse" ``` --- ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("li:first-child") %>% html_text() ``` ``` ## [1] "HTML: Hypertext Markup Language" "rvest" ``` ```r html_page %>% html_nodes("li:nth-child(2)") %>% html_text() ``` ``` ## [1] "CSS: Cascading Style Sheets" ``` ```r html_page %>% html_nodes("ol> li:last-child") %>% html_text() ``` ``` ## [1] "CSS: Cascading Style Sheets" ``` --- # HTML attributes ```html <p> Basic experience with <a href="www.r-project.org">R</a> and familiarity with the <em>Tidyverse</em> is recommended.</p> <h2>Technologies</h2> <ol> <li>HTML: <em>Hypertext Markup Language</em></li> <li>CSS: <em>Cascading Style Sheets</em></li> </ol> <h2>Packages</h2> <ul> <a href="https://github.com/tidyverse/rvest"><li>rvest</li></a> </ul> <p><strong>Note</strong>: <em>rvest</em> is included in the <em>tidyverse</em></p> ``` ```r html_page %>% html_nodes("a") %>% html_attr("href") ``` ``` ## [1] "www.r-project.org" "https://github.com/tidyverse/rvest" ``` ```r html_page %>% html_nodes("ul a") %>% html_attr("href") ``` ``` ## [1] "https://github.com/tidyverse/rvest" ``` --- # HTML tables tag | meaning --- | --- table | Table section tr | Table row td | Table cell th | Table header --- ```r basic_table <- minimal_html(' <body> <table> <tr> <th>Month</th> <th>Savings</th> </tr> <tr> <td>January</td> <td>$100</td> </tr> <tr> <td>February</td> <td>$80</td> </tr> </table> </body> ') ``` ```r basic_table %>% html_table() ``` ``` ## [[1]] ## # A tibble: 2 × 2 ## Month Savings ## <chr> <chr> ## 1 January $100 ## 2 February $80 ``` --- # Cheat sheet <img src="images/functions_and_classes.png" width="810" height="450" style="display: block; margin: auto;" /> .center[<https://github.com/yusuzech/r-web-scraping-cheat-sheet/>] --- # Why web scraping could be bad? * Scraping increases traffic. * People ignore and violate `robots.txt` and Terms of Service (ToS) of websites. * Avoid trouble following these simple rules: 1. Read ToS of the website you want to scrap. 2. Inspect `robots.txt` (see <https://cran.r-project.org/robots.txt> for instance). 3. Use a reasonable frequency of requests. --- class: sydney-blue, center, middle # Question ? .pull-down[ <a href="https://ptds.samorso.ch/"> .white[<svg viewBox="0 0 384 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M369.9 97.9L286 14C277 5 264.8-.1 252.1-.1H48C21.5 0 0 21.5 0 48v416c0 26.5 21.5 48 48 48h288c26.5 0 48-21.5 48-48V131.9c0-12.7-5.1-25-14.1-34zM332.1 128H256V51.9l76.1 76.1zM48 464V48h160v104c0 13.3 10.7 24 24 24h104v288H48z"></path></svg> website] </a> <a href="https://github.com/ptds2021/"> .white[<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> GitHub] </a> ] --- # In-class exercises (10 min) * Play with [CSS Diner](https://flukeout.github.io/) * Try this [workflow](https://smac-group.github.io/ds/section-web-scraping.html#section-workflow) --- # To go further * More details and examples in the book [An Introduction to Statistical Programming Methods with R](https://smac-group.github.io/ds/section-web-scraping.html) * <https://github.com/yusuzech/r-web-scraping-cheat-sheet/> * Want to build your own R API wrapper? Have a look at <https://colinfay.me/build-api-wrapper-package-r/> and <https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html> * [Datacamp](https://www.datacamp.com/courses/web-scraping-in-r) class on webscraping with R * [Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining](https://www.wiley.com/en-us/Automated+Data+Collection+with+R%3A+A+Practical+Guide+to+Web+Scraping+and+Text+Mining-p-9781118834817)