Web Scrapping using Python
My First Article about Web scrapping in python for Absolute Beginners.
What is Web Scraping?
Web Scraping means extracting information from websites by parsing the HTML of the web page.
How we do it?
Parsing an HTML webpage is really easy in Python. You can get the information you need with a few lines of code
What do we need?
- Pandas
- Beautiful Soup
- Selenium
1.Pandas:
Pandas is mainly used for data analysis. Pandas allow importing data from various file formats such as comma-separated-values, JSON, SQL, Microsoft Excel.
Pandas allow various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features
To install pandas:
pip install panda
2.Beautiful Soup
The GET function will get the web page for you, but you need to parse the HTML from the page to retrieve the data. That is done by BeautifulSoup.
To install BeautifulSoup:
pip install BeautifulSoup4
3.Selenium
Selenium is a portable framework for testing web applications. It supports many browsers such as Firefox, Chrome, IE, and Safari.
To install Selenium:
pip install selenium
Other supported browsers will have their own drivers available.
For more details about the Installation of Selenium
Visit the Selenium Installation Guide.
Let’s get started.
The Tutorial will be on a Basic level, There are more things that can be done using requests and BeautifulSoup.
In this tutorial, we scrap the details of available televisions from the Flipkart website
Take a look at the link that we are going to scrape, here.
- Import the Libraries
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
2. Create an empty array to store the details. In this case, we create two Empty Array to store the Product name and Product price.
productName = []
productPrice = []
3. Selenium will now start a browser session. For Selenium to work, it must access the browser driver.
driver = webdriver.Chrome(’Path to the chrome webdriver’)
driver.get("paste the url here")
content = driver.page_source
In this case, Chrome web driver is used. We can also use other web drivers.
Refer to this Website.
4. Inspect the website and look for the class name and tag.
Since we are scrapping only the name and price of the product. Inspect the tags carefully.
Do the same for the Price and copy the tags.
5. Using the Find and Find All methods in BeautifulSoup. We extract the data and store it in the variable. Refer the Below code
soup = BeautifulSoup(content,'html.parser')
for a in soup.findAll('a',href =True, attrs={'class':'_31qSD5'}):
name = a.find('div' , attrs{'class' :'_31qSD5'})
price= a.find('div',attrs{'class':'_1vC4OE _2rQ-NK'})
productName.append(name.text)
productPrice.append(price.text)
Above this code, variable called name and the price are introduced where the data under the tag <div> and the class name is given as the parameter for Find method. Using append we store the details in the Array we have created before.
6. Store the data in a Sheet.
df = pd.DataFrame({'Product Name':productName,'Price':productPrice})
df.to_csv('Products.csv',index= False, encoding = 'utf-8')
We store the data in Comma-separated values (CSV format)
The Whole code looks like This :
Now run the whole code.
All the data are stored as Products.csv in the path of the Python file.
I hope you guys enjoyed this article on “Web Scraping using Python”. I hope this blog was informative and has added value to your knowledge. If you like my work follow my Medium. Try this to experiment with different modules. Have Fun Learning
“The beautiful thing about learning is that nobody can take it away from you.”
— B.B. King
Thank you!