the table and read the second columns’s

8 min readSep 30, 2021

In my last article, I discussed about generating a dataset using the Application Programming Interface (API) and Python libraries. APIs allow us to draw very useful information from a website in an easy manner. However, not all websites have APIs and this makes it difficult to gather relevant data. In such a case, we can use web scraping to access a website’s content and create our dataset.
Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. — WebHarvy
Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect large amount of information from a single source and use it as a dataset. In this particular example, we’ll explore Wikipedia. I’ll also explain the HTML basics we would need. The complete project is available as a Notebook in the Github repository Web Scraping using Python.
This example is just for demonstration purpose. However, we must always follow the website guidelines before we can scrape that website and access its data for any commercial purpose.
This is a 2 part article. In this first part, we’ll explore how to get the data from the website using BeautifulSoup and in the second part, we’ll clean the collected dataset.
Determine the content

https://iquantum.com.pe/forums/topic/full123movies-watch-venom2-let-there-be-carnage-2021-movie-online-full-free/

https://iquantum.com.pe/forums/topic/free-download-venom-let-there-be-carnage-full-2021-full-movie-in-hd/

https://iquantum.com.pe/forums/topic/venom-let-there-be-carnage-2021-full-movie-watch-online-download-hdrip/

https://iquantum.com.pe/forums/topic/putlocker-hq-watch-venom-let-there-be-carnage-online-full-movie-stream-downloa/

https://iquantum.com.pe/forums/topic/123movies-venom-let-there-be-carnage-2021-hd-full-watch-online-free-streamin/

https://iquantum.com.pe/forums/topic/watch-venom-let-there-be-carnage-2021-full-movie-online-free-123movies/

https://iquantum.com.pe/forums/topic/download-venom-let-there-be-carnage-torrent-full-movie-2021-watch-online/

https://iquantum.com.pe/forums/topic/streaming-hdwatch-venom-let-there-be-carnage-2021-online-full-movie-downloa/

https://iquantum.com.pe/forums/topic/123movies-hd-torrent-watch-venom-let-there-be-carnage-full-online-2021/

https://iquantum.com.pe/forums/topic/venom-let-there-be-carnage-full-movie-in-hd-leaked-on-torrent-online-free-123m/

https://iquantum.com.pe/forums/topic/watch-venom-2-2021-online-full-123movies-free-09-29-2021/

https://iquantum.com.pe/forums/topic/new-movie-no-time-to-die-james-bond-2021-mp4-download-123movies-2/

https://iquantum.com.pe/forums/topic/openload-venom-2-2021-full-online-movie/

https://iquantum.com.pe/forums/topic/torrentmovies-watch-venom-2-online-2020-full-movie-free-hd/

https://iquantum.com.pe/forums/topic/123movies-venom-2-2021-watch-online-full-movie-free-hd/

https://iquantum.com.pe/forums/topic/watch-venom-2-download-online-2021-fullmovie-with-123movies/

https://iquantum.com.pe/forums/topic/leaked-hd-free-venom-2-full-movie-2021-watch-online-free-download/

https://iquantum.com.pe/forums/topic/123movies-watch-venom-2-2021-movie-online-full-for-free-download-officially/

https://iquantum.com.pe/forums/topic/123movies-watch-venom-2-2021-movie-online-full-hd-09-29-2021/

https://iquantum.com.pe/forums/topic/free-movieiflixhd-watch-venom-2-2021-hd-full-movie-online/

https://iquantum.com.pe/forums/topic/download-venom-2-torrent-full-bbrip-putlocker-hd/

https://iquantum.com.pe/forums/topic/watch-venom-2-2021-full-movie-online-free-for-123movies/

https://iquantum.com.pe/forums/topic/watch-venom-2-2021-full-movies-hd-online-free-123movies/

https://iquantum.com.pe/forums/topic/putlockers-watch-venom-2-online-2021-full-movie-for-free-hd/

https://iquantum.com.pe/forums/topic/streaming-hdwatch-venom-2-2021-online-full-movie-download-for-free/

https://iquantum.com.pe/forums/topic/leakedvenom-2-fullmovie-2021-online-free/

https://iquantum.com.pe/forums/topic/putlocker-hd-movies-venom-let-there-be-carnage-2021-hindi-dubbed-full-movie/

https://iquantum.com.pe/forums/topic/watch-venom-2-2021-full-movie-download-hindi-dubbed-1080p/

https://iquantum.com.pe/forums/topic/%ef%bd%86%ef%bd%95%ef%bd%8c%ef%bd%8c-%ef%bd%8d%ef%bd%8f%ef%bd%96%ef%bd%89%ef%bd%85venom-2-2021-watch-online-videos-reddit/

https://pastelink.net/3hom9

https://paiza.io/projects/v1WooDgtBiOm5OLXwMINDQ?language=php

https://ide.geeksforgeeks.org/PfUzIeV63Z

https://bitbin.it/dr0nnNgb/

http://pastie.org/p/2aDw7Zk9rJiUwPTNcrSIMZ
https://controlc.com/index.php?act=submit

https://p.teknik.io/f32Jt

https://www.88posts.com/post/718664/free-watch-venom-2-let-there-be-carnage

https://www.bankier.pl/forum/temat_free-watch-venom-2-let-there-be-carnage,49809107.html

https://www.xpin24.com/post/19136/free-watch-venom-2-let-there-be-carnage

https://www.peeranswer.com/question/6152175d6515c84c50ec7fa7

https://www.techrum.vn/threads/free-watch-venom-2-let-there-be-carnage.490784/
https://paste.rs/x9m

https://ideone.com/vymuqF

https://pasteio.com/xBZKFv0n7PEc

http://cpp.sh/3jbye

https://tech.io/snippet/fhENuwT

http://www.liknti.com/threads/free-watch-venom-2-let-there-be-carnage.435359/

https://paste.cutelyst.org/Y64V8niSS

https://paste.artemix.org/-/2k5wA5

https://0paste.com/310516

http://paste.akingi.com/MKcYBv6J

https://paste.feed-the-beast.com/view/9676d492

https://wakelet.com/wake/eSWSblofmka8ZZXn8xw8I

https://wakelet.com/wake/C7derubobUc-mIsD-KSFC

https://wakelet.com/wake/gsCK7pBJXkYgZbsQmjgcS

https://wakelet.com/wake/HgMhyjBwFgZiAUVZNJQUQ

https://wakelet.com/wake/Cgj8hZUKl4XfkYwYJSSxC

https://wakelet.com/wake/xCDxvUU2agJwHsdzeuD9I

https://wakelet.com/wake/iOkg1SDuiRYXafTMgGBhz

https://wakelet.com/wake/t8flwnTmk-5dw6DMV_zPC

https://wakelet.com/wake/TaWFdqeHtyUQ4hX9oyYHX

https://wakelet.com/wake/BDjWn4QRy3pBOkx3w3dYz

https://wakelet.com/wake/-iBeYspADIOC4w7Non4gn

https://wakelet.com/wake/ZaInEbDwiXn_IF3zrdvZF

https://wakelet.com/wake/tXGI6ENEw-q2UrAeJ0KSt

https://wakelet.com/wake/v-hoWt7yQi2Eu7mxujOAm

https://wakelet.com/wake/Y6g9_qdyxfnVAA40Quu-O

https://wakelet.com/wake/JifZOFCvrEIWOs8G5uiHw

https://wakelet.com/wake/4oxDQwFGwRST1wICL9Dd2

https://wakelet.com/wake/3C11GshISOKGWpcVzWlWt

https://wakelet.com/wake/ntXSq2srHQsGVnPExjIiL

https://wakelet.com/wake/DzsrzgNZRaIPjntM_cHsW

https://wakelet.com/wake/g84C6thUuT78Tj2gq2XP3

https://wakelet.com/wake/FyQQ1R7L03tuAh8N-0UxM

https://wakelet.com/wake/QmEGbRoc1VzsRZfU4sgeX

https://wakelet.com/wake/JGoMDZv3SdPs-8oiohVD8

https://wakelet.com/wake/B0kuef4cz2JOADUiBxgNu

https://wakelet.com/wake/29lnukZ5YZM3rSDKHircJ

https://wakelet.com/wake/O_5fVFZVyko72NNJy1xxg

https://wakelet.com/wake/-1fNIU9poOBCkQq1ZJ-u8

https://wakelet.com/wake/F06SCA7GzCceWL5NhO60j

https://wakelet.com/wake/Gt1-vRcjBSNnaXN4nEyb4

https://wakelet.com/wake/gAl4GotrkXAGD__EbbXob

https://wakelet.com/wake/dw2IDWJHuGGedyp5SRG1v

https://wakelet.com/wake/7L5AgxKA83e8F8B4W8bUP

https://wakelet.com/wake/8Fl-Ap23wzdLGfJP3TXtD

https://wakelet.com/wake/VWV8ql1k1n7q0avjPNfeO

https://wakelet.com/wake/qd4m40HuzANyMDIK63UWy

https://wakelet.com/wake/3CplUZrfbTxRx6trvdmlZ

https://wakelet.com/wake/keAwq4J5udsXhD0CDpXxR

https://wakelet.com/wake/UKcna5v3-pZ6DfsKzVcjX

https://wakelet.com/wake/zmF13pS6xX5cx6izS3ijA

https://wakelet.com/wake/EdSgm7m5d5_U4NDzsdh3_

https://wakelet.com/wake/-ZgWzn4CrMbSrGr2--OSu

https://wakelet.com/wake/Ruc3uktQnRdMcKrCtHCCB

https://wakelet.com/wake/tVx7XMa5k7f5CZfss1_0c

https://wakelet.com/wake/bV9LctJFkAUolIC7_E_tv

https://wakelet.com/wake/1NOLOGG9wptARoPLWrX_V

https://wakelet.com/wake/CdIkFaUdFkWLqpJggXqVy

https://wakelet.com/wake/Xnb-GW_LgvTrwnr6nQxCy

https://wakelet.com/wake/c8Fb5EFJyBqgf4ULyVTZt

https://wakelet.com/wake/Gl8zDnv5W78GM-kaVKRwS

https://wakelet.com/wake/WkMI2AetO62aFz87uOPbs

https://wakelet.com/wake/dhhahIX9PCFrUV1VZKZ5w

https://wakelet.com/wake/mONi8tYPqwk80Zn-a_2Yr

https://wakelet.com/wake/qv00RyDF1eiixPPWUnB8d

https://wakelet.com/wake/F4B7bPHAK-7r9lD8dYfO3

https://wakelet.com/wake/PrrTvyKjgsFbVpv9nDesk

https://wakelet.com/wake/Jug8W2QXKEeF3ylwUUWjR

https://wakelet.com/wake/fshgUbCw9N5zZGvPMWNgx

https://wakelet.com/wake/XLW3rP4r5NZ2aVf6n8xcY

https://medium.com/@Venom2fullmovie/123movies-watch-venom-let-there-be-carnage-2021-online-full-movie-hd-free-da780d0d815a

https://medium.com/@Venom2fullmovie/123movies-watch-here-venom-let-there-be-carnage-full-movie-online-free-fc450e8e2b3c

https://medium.com/@Venom2fullmovie/putlocker-hq-watch-venom-let-there-be-carnage-online-full-movie-stream-download-9669379e4d2c

https://medium.com/@Venom2fullmovie/streaming-hd-watch-venom-let-there-be-carnage-2021-online-full-movie-download-for-free-78b264be17fb

“man drawing on dry-erase board” by Kaleidico on Unsplash
We’ll access the List of countries and dependencies by population Wikipedia webpage. The webpage includes a table with the names of countries, their population, date of data collection, percentage of world population and source. And if we go to any country’s page, all information about it is written on the page with a standard box on the right. This box includes a lot of information such as total area, water percentage, GDP etc.
Here, we will combine the data from these two webpages into one dataset.
List of Countries: On accessing the first page, we’ll extract the list of countries, their population and percentage of world population.
Country: We’ll then access each country’s page, and get information including total area, percentage water, and GDP (nominal).
Thus, our final dataset will include information about each country.
HTML Basics
Each webpage that you view in your browser is actually structured in HyperText Markup Language (HTML). It has two parts, head which includes the title and any imports for styling and JavaScript and the body which includes the content that gets displayed as a webpage. We’re interested in the body of the webpage.
HTML is comprised of tags. A tag is described by an opening < and closing > angular bracket with the name of the tag inside it as a start, while it marks an ending if there is a forward slash / after the opening angular bracket. For example, <div></div>, <p>Some text</p> etc.
Homepage.html as an example
There are two direct ways to access any element (tag) present on the webpage. We can use id, which is unique or we can use a class which can be associated with multiple elements. Here, we can see that <div> has the attribute id as base which acts as a reference to this element while all table cells marked by td have the same class called data.
Generally useful tags include:
<div>: Whenever you include certain content, you enclose it together inside this single entity. It can act as the parent for a lot of different elements. So, if some style changes are applied here, they’ll also reflect in its child elements.
<a>: The links are described in this tag, where the webpage that will get loaded on click of this link is mentioned in its property href.
<p>: Whenever some information is to be displayed on the webpage as a block of text, this tag is used. Each such tag appears as its own paragraph.
<span>: When information is to be displayed inline, we use this tag. When two such tags are placed side by side, they’ll appear in the same line unlike the paragraph tag.
<table>: Tables are displayed in HTML with the help of this tag, where data is displayed in cells formed by intersection of rows and columns.
Import Libraries
We first begin by importing necessary libraries, namely, numpy, pandas, urllib and BeautifulSoup.
numpy: A very popular library that makes array operations very simple and fast.
pandas: It helps us to convert the data in a tabular structure, so we can manipulate the data with numerous functions that have been efficiently developed.
urllib: We use this library to open the url from which we would like to extract the data.
BeautifulSoup: This library helps us to get the HTML structure of the page that we want to work with. We can then, use its functions to access specific elements and extract relevant information.
Import all libraries
Understand the data
Initially, we define we just the basic function of reading the url and then extracting the HTML from the same. We’ll introduce new functions as and where they are needed.
Function to get HTML of a webpage
In the getHTMLContent() function, we pass in the URL. Here, we first open the url using the urlopen method. This enables us to apply BeautifulSoup library to get the HTML using a parser. While there are many parsers available, in this example we use html.parser which enables us to parse HTML files. Then, we simply return the output which we can then use to extract our data.
We use this function to get the HTML content for the Wikipedia page of List of countries. We see that the countries are present in a table. So, we use the find_all() method to find all tables on the page. The parameter that we supply inside this function determines the element that it returns. As we require tables, we pass the argument as table and then iterate over all tables to identify the one we need.
We print each table with the prettify() function. This function makes the output more readable. Now, we need to analyse the output and see which table has the data we are searching for. After much inspection, we can see that the table with the class, wikitable sortable, has the data we need. Thus, our next step is to access this table and its data. For this, we will use the function find() which allows us to not only specify the element we are looking for but also specify its properties such as the class name.
Print all country links
A table in HTML is comprised of rows denoted by the tags <tr></tr>. Each row has cells which can either be headings defined using <th></th> or data defined using <td></td>. Thus, to access each country’s webpage, we can get its link from the cells in the country column of the table (second column). So, we iterate over all the rows in the table and read the second columns’s data in the variable country_link. For each row, we extract the cells, and get the element a in second column (numbering in Python starts with 0, so second column would mean cell[1]). Finally, we print all the links.

The links do not include the base address, so whenever we access any of these links, we’ll append https://en.wikipedia.org as the prefix.
While the function I developed to extract the data from each country’s webpage might appear small, there have been many iterations for it before I finalised the function. Let’s explore it step by step.
Each country’s page includes an information box on the right which includes the Motto, Name, GDP, Area and other important features. So, first weidentified the name of this box by the same steps as before and it was a table with the class as infobox geography vcard. Next, we define the variable additional_details to collect all the information we will get from this page in an array which we can then append with the list of countries dataset.
When we enter the inspect mode of Chrome browser (right click anywhere and select Inspect option) on the country page, we can look at the classes for each heading in the table. We are interested in four fields, Area — Total area, Water (%), and GDP (nominal) — Total, Per capita.

the table and read the second columns’s

Written by Ammad Hoorain - Cooking Videos