the table and read the second columns’s
In my last article, I discussed about generating a dataset using the Application Programming Interface (API) and Python libraries. APIs allow us to draw very useful information from a website in an easy manner. However, not all websites have APIs and this makes it difficult to gather relevant data. In such a case, we can use web scraping to access a website’s content and create our dataset.
Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. — WebHarvy
Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect large amount of information from a single source and use it as a dataset. In this particular example, we’ll explore Wikipedia. I’ll also explain the HTML basics we would need. The complete project is available as a Notebook in the Github repository Web Scraping using Python.
This example is just for demonstration purpose. However, we must always follow the website guidelines before we can scrape that website and access its data for any commercial purpose.
This is a 2 part article. In this first part, we’ll explore how to get the data from the website using BeautifulSoup and in the second part, we’ll clean the collected dataset.
Determine the content
https://iquantum.com.pe/forums/topic/watch-venom-2-2021-online-full-123movies-free-09-29-2021/
https://iquantum.com.pe/forums/topic/openload-venom-2-2021-full-online-movie/
https://iquantum.com.pe/forums/topic/torrentmovies-watch-venom-2-online-2020-full-movie-free-hd/
https://iquantum.com.pe/forums/topic/123movies-venom-2-2021-watch-online-full-movie-free-hd/
https://iquantum.com.pe/forums/topic/watch-venom-2-download-online-2021-fullmovie-with-123movies/
https://iquantum.com.pe/forums/topic/123movies-watch-venom-2-2021-movie-online-full-hd-09-29-2021/
https://iquantum.com.pe/forums/topic/free-movieiflixhd-watch-venom-2-2021-hd-full-movie-online/
https://iquantum.com.pe/forums/topic/download-venom-2-torrent-full-bbrip-putlocker-hd/
https://iquantum.com.pe/forums/topic/watch-venom-2-2021-full-movie-online-free-for-123movies/
https://iquantum.com.pe/forums/topic/watch-venom-2-2021-full-movies-hd-online-free-123movies/
https://iquantum.com.pe/forums/topic/putlockers-watch-venom-2-online-2021-full-movie-for-free-hd/
https://iquantum.com.pe/forums/topic/leakedvenom-2-fullmovie-2021-online-free/
https://iquantum.com.pe/forums/topic/watch-venom-2-2021-full-movie-download-hindi-dubbed-1080p/
https://vocus.cc/user-setting/615601b8fd897800017582f2
https://vocus.cc/article/6156021afd8978000175841e
https://vocus.cc/article/61560247fd897800017584b0
https://vocus.cc/article/61560248fd897800017584b6
https://vocus.cc/article/6156024afd897800017584bb
https://vocus.cc/article/6156024cfd897800017584c0
https://vocus.cc/article/6156021ffd8978000175842f
https://vocus.cc/article/61560220fd89780001758431
https://vocus.cc/article/61560220fd89780001758432
https://vocus.cc/article/61560221fd89780001758438
https://vocus.cc/article/61560221fd8978000175843b
https://vocus.cc/article/61560223fd89780001758443
https://vocus.cc/article/61560223fd89780001758446
https://vocus.cc/article/61560224fd8978000175844b
https://vocus.cc/article/61560252fd897800017584d3
https://vocus.cc/article/61560253fd897800017584d8
https://vocus.cc/article/61560255fd897800017584de
https://vocus.cc/article/61560257fd897800017584e4
https://vocus.cc/article/61560259fd897800017584eb
https://vocus.cc/article/6156025afd897800017584f2
https://vocus.cc/article/615602adfd897800017585fc
https://vocus.cc/article/6156033efd897800017587b6
https://vocus.cc/article/61560496fd89780001758b79
https://vocus.cc/article/61560497fd89780001758b7c
https://vocus.cc/article/61560499fd89780001758b84
https://vocus.cc/article/6156049afd89780001758b89
https://vocus.cc/article/6156049cfd89780001758b93
https://vocus.cc/article/615604befd89780001758bee
https://vocus.cc/article/615604d3fd89780001758c2a
https://vocus.cc/article/615604dffd89780001758c4a
https://paiza.io/projects/v1WooDgtBiOm5OLXwMINDQ?language=php
https://ide.geeksforgeeks.org/PfUzIeV63Z
http://pastie.org/p/2aDw7Zk9rJiUwPTNcrSIMZ
https://controlc.com/index.php?act=submit
https://www.88posts.com/post/718664/free-watch-venom-2-let-there-be-carnage
https://www.bankier.pl/forum/temat_free-watch-venom-2-let-there-be-carnage,49809107.html
https://www.xpin24.com/post/19136/free-watch-venom-2-let-there-be-carnage
https://www.peeranswer.com/question/6152175d6515c84c50ec7fa7
https://www.techrum.vn/threads/free-watch-venom-2-let-there-be-carnage.490784/
https://paste.rs/x9m
https://pasteio.com/xBZKFv0n7PEc
https://tech.io/snippet/fhENuwT
http://www.liknti.com/threads/free-watch-venom-2-let-there-be-carnage.435359/
https://paste.cutelyst.org/Y64V8niSS
https://paste.artemix.org/-/2k5wA5
http://paste.akingi.com/MKcYBv6J
https://paste.feed-the-beast.com/view/9676d492
https://wakelet.com/wake/eSWSblofmka8ZZXn8xw8I
https://wakelet.com/wake/C7derubobUc-mIsD-KSFC
https://wakelet.com/wake/gsCK7pBJXkYgZbsQmjgcS
https://wakelet.com/wake/HgMhyjBwFgZiAUVZNJQUQ
https://wakelet.com/wake/Cgj8hZUKl4XfkYwYJSSxC
https://wakelet.com/wake/xCDxvUU2agJwHsdzeuD9I
https://wakelet.com/wake/iOkg1SDuiRYXafTMgGBhz
https://wakelet.com/wake/t8flwnTmk-5dw6DMV_zPC
https://wakelet.com/wake/TaWFdqeHtyUQ4hX9oyYHX
https://wakelet.com/wake/BDjWn4QRy3pBOkx3w3dYz
https://wakelet.com/wake/-iBeYspADIOC4w7Non4gn
https://wakelet.com/wake/ZaInEbDwiXn_IF3zrdvZF
https://wakelet.com/wake/tXGI6ENEw-q2UrAeJ0KSt
https://wakelet.com/wake/v-hoWt7yQi2Eu7mxujOAm
https://wakelet.com/wake/Y6g9_qdyxfnVAA40Quu-O
https://wakelet.com/wake/JifZOFCvrEIWOs8G5uiHw
https://wakelet.com/wake/4oxDQwFGwRST1wICL9Dd2
https://wakelet.com/wake/3C11GshISOKGWpcVzWlWt
https://wakelet.com/wake/ntXSq2srHQsGVnPExjIiL
https://wakelet.com/wake/DzsrzgNZRaIPjntM_cHsW
https://wakelet.com/wake/g84C6thUuT78Tj2gq2XP3
https://wakelet.com/wake/FyQQ1R7L03tuAh8N-0UxM
https://wakelet.com/wake/QmEGbRoc1VzsRZfU4sgeX
https://wakelet.com/wake/JGoMDZv3SdPs-8oiohVD8
https://wakelet.com/wake/B0kuef4cz2JOADUiBxgNu
https://wakelet.com/wake/29lnukZ5YZM3rSDKHircJ
https://wakelet.com/wake/O_5fVFZVyko72NNJy1xxg
https://wakelet.com/wake/-1fNIU9poOBCkQq1ZJ-u8
https://wakelet.com/wake/F06SCA7GzCceWL5NhO60j
https://wakelet.com/wake/Gt1-vRcjBSNnaXN4nEyb4
https://wakelet.com/wake/gAl4GotrkXAGD__EbbXob
https://wakelet.com/wake/dw2IDWJHuGGedyp5SRG1v
https://wakelet.com/wake/7L5AgxKA83e8F8B4W8bUP
https://wakelet.com/wake/8Fl-Ap23wzdLGfJP3TXtD
https://wakelet.com/wake/VWV8ql1k1n7q0avjPNfeO
https://wakelet.com/wake/qd4m40HuzANyMDIK63UWy
https://wakelet.com/wake/3CplUZrfbTxRx6trvdmlZ
https://wakelet.com/wake/keAwq4J5udsXhD0CDpXxR
https://wakelet.com/wake/UKcna5v3-pZ6DfsKzVcjX
https://wakelet.com/wake/zmF13pS6xX5cx6izS3ijA
https://wakelet.com/wake/EdSgm7m5d5_U4NDzsdh3_
https://wakelet.com/wake/-ZgWzn4CrMbSrGr2--OSu
https://wakelet.com/wake/Ruc3uktQnRdMcKrCtHCCB
https://wakelet.com/wake/tVx7XMa5k7f5CZfss1_0c
https://wakelet.com/wake/bV9LctJFkAUolIC7_E_tv
https://wakelet.com/wake/1NOLOGG9wptARoPLWrX_V
https://wakelet.com/wake/CdIkFaUdFkWLqpJggXqVy
https://wakelet.com/wake/Xnb-GW_LgvTrwnr6nQxCy
https://wakelet.com/wake/c8Fb5EFJyBqgf4ULyVTZt
https://wakelet.com/wake/Gl8zDnv5W78GM-kaVKRwS
https://wakelet.com/wake/WkMI2AetO62aFz87uOPbs
https://wakelet.com/wake/dhhahIX9PCFrUV1VZKZ5w
https://wakelet.com/wake/mONi8tYPqwk80Zn-a_2Yr
https://wakelet.com/wake/qv00RyDF1eiixPPWUnB8d
https://wakelet.com/wake/F4B7bPHAK-7r9lD8dYfO3
https://wakelet.com/wake/PrrTvyKjgsFbVpv9nDesk
https://wakelet.com/wake/Jug8W2QXKEeF3ylwUUWjR
https://wakelet.com/wake/fshgUbCw9N5zZGvPMWNgx
https://wakelet.com/wake/XLW3rP4r5NZ2aVf6n8xcY
“man drawing on dry-erase board” by Kaleidico on Unsplash
We’ll access the List of countries and dependencies by population Wikipedia webpage. The webpage includes a table with the names of countries, their population, date of data collection, percentage of world population and source. And if we go to any country’s page, all information about it is written on the page with a standard box on the right. This box includes a lot of information such as total area, water percentage, GDP etc.
Here, we will combine the data from these two webpages into one dataset.
List of Countries: On accessing the first page, we’ll extract the list of countries, their population and percentage of world population.
Country: We’ll then access each country’s page, and get information including total area, percentage water, and GDP (nominal).
Thus, our final dataset will include information about each country.
HTML Basics
Each webpage that you view in your browser is actually structured in HyperText Markup Language (HTML). It has two parts, head which includes the title and any imports for styling and JavaScript and the body which includes the content that gets displayed as a webpage. We’re interested in the body of the webpage.
HTML is comprised of tags. A tag is described by an opening < and closing > angular bracket with the name of the tag inside it as a start, while it marks an ending if there is a forward slash / after the opening angular bracket. For example, <div></div>, <p>Some text</p> etc.
Homepage.html as an example
There are two direct ways to access any element (tag) present on the webpage. We can use id, which is unique or we can use a class which can be associated with multiple elements. Here, we can see that <div> has the attribute id as base which acts as a reference to this element while all table cells marked by td have the same class called data.
Generally useful tags include:
<div>: Whenever you include certain content, you enclose it together inside this single entity. It can act as the parent for a lot of different elements. So, if some style changes are applied here, they’ll also reflect in its child elements.
<a>: The links are described in this tag, where the webpage that will get loaded on click of this link is mentioned in its property href.
<p>: Whenever some information is to be displayed on the webpage as a block of text, this tag is used. Each such tag appears as its own paragraph.
<span>: When information is to be displayed inline, we use this tag. When two such tags are placed side by side, they’ll appear in the same line unlike the paragraph tag.
<table>: Tables are displayed in HTML with the help of this tag, where data is displayed in cells formed by intersection of rows and columns.
Import Libraries
We first begin by importing necessary libraries, namely, numpy, pandas, urllib and BeautifulSoup.
numpy: A very popular library that makes array operations very simple and fast.
pandas: It helps us to convert the data in a tabular structure, so we can manipulate the data with numerous functions that have been efficiently developed.
urllib: We use this library to open the url from which we would like to extract the data.
BeautifulSoup: This library helps us to get the HTML structure of the page that we want to work with. We can then, use its functions to access specific elements and extract relevant information.
Import all libraries
Understand the data
Initially, we define we just the basic function of reading the url and then extracting the HTML from the same. We’ll introduce new functions as and where they are needed.
Function to get HTML of a webpage
In the getHTMLContent() function, we pass in the URL. Here, we first open the url using the urlopen method. This enables us to apply BeautifulSoup library to get the HTML using a parser. While there are many parsers available, in this example we use html.parser which enables us to parse HTML files. Then, we simply return the output which we can then use to extract our data.
We use this function to get the HTML content for the Wikipedia page of List of countries. We see that the countries are present in a table. So, we use the find_all() method to find all tables on the page. The parameter that we supply inside this function determines the element that it returns. As we require tables, we pass the argument as table and then iterate over all tables to identify the one we need.
We print each table with the prettify() function. This function makes the output more readable. Now, we need to analyse the output and see which table has the data we are searching for. After much inspection, we can see that the table with the class, wikitable sortable, has the data we need. Thus, our next step is to access this table and its data. For this, we will use the function find() which allows us to not only specify the element we are looking for but also specify its properties such as the class name.
Print all country links
A table in HTML is comprised of rows denoted by the tags <tr></tr>. Each row has cells which can either be headings defined using <th></th> or data defined using <td></td>. Thus, to access each country’s webpage, we can get its link from the cells in the country column of the table (second column). So, we iterate over all the rows in the table and read the second columns’s data in the variable country_link. For each row, we extract the cells, and get the element a in second column (numbering in Python starts with 0, so second column would mean cell[1]). Finally, we print all the links.
The links do not include the base address, so whenever we access any of these links, we’ll append https://en.wikipedia.org as the prefix.
While the function I developed to extract the data from each country’s webpage might appear small, there have been many iterations for it before I finalised the function. Let’s explore it step by step.
Each country’s page includes an information box on the right which includes the Motto, Name, GDP, Area and other important features. So, first weidentified the name of this box by the same steps as before and it was a table with the class as infobox geography vcard. Next, we define the variable additional_details to collect all the information we will get from this page in an array which we can then append with the list of countries dataset.
When we enter the inspect mode of Chrome browser (right click anywhere and select Inspect option) on the country page, we can look at the classes for each heading in the table. We are interested in four fields, Area — Total area, Water (%), and GDP (nominal) — Total, Per capita.