Building A Basic Web Scraper

A while back I built a simple web scraper that pulls front page articles a website called CoinDesk.com. If you aren’t familiar with CoinDesk, it is a popular site for news relating to cryptocurrency and blockchain technology. I didn’t see any API info on their site, so I figured this scraper wouldn’t cause any harm (plus I’ve only run the thing a few times). Let me say I built this solely for the purpose of gaining experience and as a fun way of learning how to work with BeautifulSoup (a python library for web scraping).

Let’s start with the basics. What is web scraping exactly?

Well, web scraping is the process of automatically extracting information from the web. There are different variations of web scrapers from the extremely basic, to the highly advanced, and they are used for parsing all kinds of data at a much faster rate than someone sitting at a PC going through a website page by page. Like most tools on the web, there is a malicious side to web scraping as well. Bad characters might use a web scraper to pull information from a competitors website or even their database, which could allow them access to trade secrets and valued customers. NOT good. If you were looking to do anything remotely close to this, I would highly advise against it.

Now then, what will we need to build this web scraper? Well, first you will need python 3 installed on your computer. I’m using the 3.7 version, but I believe as long as you have any version of 3 you’ll be fine. Next, you’ll have to use good ol’ pip to install BeautifulSoup4, Requests(for connecting to websites), and a XML & HTML processing library called lxml.

If you don’t have pip yet, check this link out to get it. The install is pretty straightforward, and once you get it, packages will be easier to get.

$ pip install beautifulsoup4$ pip install requests$ pip install lxml

Once those complete, you can start building out your script. I wrote my script at my day job so I only had IDLE available, but if you have an editor you enjoy using, feel free to use that. So to start, import your modules:

# Web scraper that pulls frontpage articles from CoinDesk.com (cryptocurrency site) and saves them to a CSV file.from bs4 import BeautifulSoup
import requests
import csv

***I know I haven’t mentioned the csv module yet, because it’s sort of optional. I used it because I thought it would be cool to save whatever I scraped to an external file for further usage. Don’t worry about trying to “pip install csv” because csv is part of the python standard library, which means it is already available when you first install Python. So just add the import statement if you want to create a csv file.

Next you want to create a variable to hold your request to whatever website you’d like to scrape from:

source = requests.get('http://www.coindesk.com').text
soup = BeautifulSoup(source, 'lxml')

Here I create two variables. One to hold the requests.get() call to the CoinDesk website, and another to hold the call to BeautifulSoup, which takes the first variable source and a parameter of ‘lxml’ to connect with the lxml module we imported earlier. Next, I want to create the csv variables for creating and writing to the csv file.

csv_file = open("coindesk_scrape.csv", "w")
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Title', 'Date','Summary', 'URL'])

The csv_file variable uses the open() function to find the file called coindesk_scrape.csv. The “w” stands for write, which allows us to write to the file. I went back and forth about using the “w” instead of the “a”, which will append to a csv file instead of overwriting it. But I felt if I wanted to reuse this script at any point to scrape another site, it would be better to create new csv files for each one. The csv_writer.writerow() portion of this section adds a header row to the csv file that we create.

The next section of the script is a for loop that takes the content from the soup variable (remember, soup grabbed the entire html page from the source variable and parses out the html through the lxml library), and filters it down based on the html markup we’ve targeted. In this case, CoinDesk has various html classes set up for different sections of their website, so I’ve targeted the ones that match up well against the header being created in the csv file [‘Title’, ‘Date’, ‘Summary’, ‘URL’]. Finding these items requires going to the website of choice, and looking at the source code for it. You can typically get to the source code by right clicking on the desired page, but I use Chrome quite a bit, so I did my “inspecting” with their Developer Tools, which can be found under the “More Tools” menu or through different keyboard shortcuts (ie: CTRL + Shift + I on a Chromebook).

for article in soup.find_all('a', class_="stream-article"):    #Targeted the section for articles    title = article.find('div', class_="meta").h3.text    #Grabbed the Title of each article
print(title)
article_date = article.find('div', class_="time").text #Pulled the date/time of publishing
print(article_date)
summary = article.find('p').text #Grabbed the article summary
print(summary)

web_link = article.get('href') #Pulled down the URL of the article
print(web_link)
print() #Just an extra print statement to add some separation between each article.

I picked the variable names to match up with the csv_writer header we created earlier. As you can see, some of the html I targeted required a bit more finetuning to get the right data. Of course, sometimes you get lucky and it’s just grabbing an “href”, as seen in the web_link variable. Another thing you may have noticed is that I am using two BeautifulSoup functions for the most part to collect the data I need. The find() function allows you to target specific html markup, and parse it into text using dot notation. I also used the get() function to target the urls for the web_link variable. I’m really just scratching the surface here of what BeautifulSoup can do, but I thought this would suffice for a first attempt at scraping.

Finally, we connect these variables to our csv_writer.writerow() from before:

csv_writer.writerow([title, article_date, summary, web_link])  #Making sure the variables line up with our header.csv_file.close()   #Closing the file.

We have to make sure we close our file, because if you may have noticed above, we didn’t use the open() function with a “with” statement. Basically, a with statement ensures that python will automatically close your file behind the scenes for you when it is no longer needed. I generally use with when dealing with files, but it is not necessary. Sometimes manually closing out your file is better for whatever script you’re writing. For this one we do just that.

Voila! Script completed! Feel free to tweak the code for use with different websites (but make sure you’re not sending requests to a site that doesn’t want them!), adjust the csv for different information, etc. I had a lot of fun and felt pretty good after building this bad boy. I’d like to something similar that pulls data using an api, but first I need to pick a good website to use. Maybe a script to pull down weekly weather data? Who knows :)

Anyway, thanks for reading!

~OJ

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Osei J

IT Pro, Occasional Music Maker & Blogger. Interested in pretty much all things cloud, cybersecurity and tech related.