Understanding the Basics

September 29, 2021

Before Moving on to a scraping project you must understand basics of web scraping. The basics is about the terms and tools that we use during scraping. Let's get started.

What do you mean by "headless browser"?

A headless browser is a web browser with no user interface (UI) whatsoever. Instead, it follows instructions defined by software developers in different programming languages. Headless browsers are mostly used for running automated quality assurance tests, or to scrape websites.

Is it legal to scrape a website?

Websites often allow other software to scrape their content. Please refer to the robots exclusion standard (robots.txt file) of the website that you intend to scrape, as it usually describes which pages you are allowed to scrape. You should also check the terms of service to see if you are allowed to scrape.

What is a headless environment?

Headless means that the given device or software has no user interface or input mechanism such as a keyboard or mouse. The term "headless environment" is more often used to describe computer software designed to provide services to other computers or servers.

What is headless Chrome?

Headless Chrome is essentially the Google Chrome web browser without its graphical user interface (GUI), based on the same underlying technology. Headless Chrome is instead controlled by scripts written by software developers.

What is Google Puppeteer?

Puppeteer is a Node.js library maintained by Chrome's development team from Google. Puppeteer provides a high-level API to control headless Chrome or Chromium or interact with the DevTools protocol.

Is Selenium a framework?

Yes, but not a front-end web framework like Angular or React; Selenium is a software testing framework for web applications. Its primary use-case is to automating quality assurance tests on headless browsers, but it's often used to automate administration tasks on websites too.