Unfortunately, some of it is hard to access programmatically. Wikipedia defines web scraping as follows: Web scraping, web harvesting, or web data extraction data scraping used for extracting data from websites. Here are three approaches i. Python libraries for web scraping which are among the most popular: Sending an HTTP request , ordinarily via Requests , to a webpage and then parsing the HTML ordinarily using BeautifulSoup which is returned to access the desired information.

Typical Use Case: Standard web scraping problem, refer to the case study. Scrapy , which can be thought of as more of a general web scraping framework, which can be used to build spiders and scrape data from various websites whilst minimizing repetition. While you could scrape data using any other programming language as well, Python is commonly used due to its ease of syntax as well as the large variety of libraries available for scraping purposes in Python.

After this short intro, this post will move on to some web scraping ethics, followed by some general information on the libraries which will be used in this post. Lastly, everything we have learned so far will be applied to a case study in which we will acquire the data of all companies in the portfolio of Sequoia Capital, one of the most well-known VC firms in the US. After checking their website and their robots. In the scope of this blog post, we will only be able to have a look at one of the three methods above.

Note that the tools above are not mutually exclusive; you might, for example, get some HTML text with Scrapy or Selenium and then parse it with BeautifulSoup. Web Scraping Ethics One factor that is extremely relevant when conducting web scraping is ethics and legality.

It tends to depend on the specific data you are scraping. For that, the following section will come in handy. Understanding robots. The following shows an example robots. The Hackernews robots. Because only certain URLs are disallowed, this implicitly allows everything else.

An alternative would be to exclude everything and then explicitly specify only certain URLs which can be accessed by crawlers or other bots. Also, notice the crawl delay of 30 seconds which means that each bot should only send one request every 30 seconds. It is good practice, in general, to let your crawler or scraper sleep in regular rather large intervals since too many requests can bring down sites down, even when they come from human users.

When looking at the robots. Anything else e. This makes sense when you consider the mission of Hackernews, which is mostly to disseminate information. Refer to the Gist below for the robots. Check it out for yourself, since it is much longer than shown below, but essentially, no bots are allowed to perform a search on Google, specified on the first two lines.

Moving on, we will take a look at the specific Python packages which will be used in the scope of this case study, namely Requests and BeautifulSoup. Working together to support transportation efficient communities.

