How To Parse Website (1)

How To Parse Website (1)

In this post I will guide you step by step to parse website. I chose python, though. Each language has its own style but I prefer python because I know its strength and its resilient. Maybe some other time I can try another language as an example. This post will not show you how to install python.

Sometimes you need data from website. Particular one from table or from any content in the website. So what will you do ?. Manually you need to open the website and copy to another place. But this is not how computer scientists thinking, hehehe. Just automate the process using small script is sufficient.

Here is one example. Let just say I want to parse something from this website. http://www.cpcb.gov.in/CAAQM/frmCurrentDataNew.aspx?CityId=7&StationName=Hyderabad&StateId=30

You can try this link in your browser too.

Before parsing, we need to catch the content of the website. You can use urllib library from python like this. Oh by the way I’m using python 3.4 so you better catch up with minimum python 3 to make this script works. Open your python command line.

>>> import urllib.request
>>> f = urllib.request.urlopen(‘http://www.cpcb.gov.in/CAAQM/frmCurrentDataNew.aspx?CityId=7&StationName=Hyderabad&StateId=30’)
>>> content = f.read()
>>> f.close()
>>> content

b’\r\n\r\n<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>\
r\n\r\n<html xmlns=”http://www.w3.org/1999/xhtml” >\r\n<head id=”Head1″><title>\r\n\t::CPCB Current Air Pollution Levels::\r\n</titl
e><link href=”style.css” rel=”stylesheet” type=”text/css” /><meta http-equiv=”refresh” content=”60″ />\r\n <script language=”Java
Script”>\r\n<!–\r\n\r\nvar sURL = unescape(window.location.pathname);\r\n\r\nfunction doLoad()\r\n{ \r\n var qrStr = window.l
ocation.search;\r\n var spQrStr = qrStr.substring(1);\r\n setTimeout( “location.href=\’frmCurrentDataNew.aspx?station=”+spQrSt
r+”\'”, 5*60000 );\r\n}\r\n\r\nfunction refresh()\r\n{\r\n window.location.href = sURL;\r\n}\r\n//–>\r\n</script>\r\n</head>\r\n…

I’ve omit the long content. Try it with your python command line. Here is some explanation.

import urllib.request

Using import in python means to load the library so we can use it. In python, library called module. After import comes the name of the library, in this post is urllib.request.

f = urllib.request.urlopen(‘http://www.cpcb.gov.in/CAAQM/frmCurrentDataNew.aspx?CityId=7&StationName=Hyderabad&StateId=30’)

The command means we open the website by using urlopen() function from urllib.request module. Mind the hierarchy of the module. I will explain it in another article. Within urlopen parenthesis is the website link. It can be any link you can access using your browser. After the website is open then forward its content into f variable. This f variable will hold all of content in the link of the website. If I say the content means the source content with html formatted. Until this step we not parsing anything yet. We just browse the website and stored the object result within the f variable.

content = f.read()
content

We want to know the result from the website content so we put read() function from the object f. Just type content in the python command line then all the result will show up with html formatted.

Parse HTML
Now here comes the magic line. Using BeautifulSoup module we can parse the html formatted text. You have to install the module first and this post will not show how to install it.

>>> from bs4 import BeautifulSoup as bs
>>> parsing_content = bs(content,”html.parser”)
>>> parsing_content

<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN” “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>

<html xmlns=”http://www.w3.org/1999/xhtml”>
<head id=”Head1″><title>
::CPCB Current Air Pollution Levels::
</title><link href=”style.css” rel=”stylesheet” type=”text/css”/><meta content=”60″ http-equiv=”refresh”/>
<script language=”JavaScript”>
<!–

var sURL = unescape(window.location.pathname);

function doLoad()
{
var qrStr = window.location.search;
var spQrStr = qrStr.substring(1);
setTimeout( “location.href=’frmCurrentDataNew.aspx?station=”+spQrStr+”‘”, 5*60000 );
}

Once again omitted the content. Now you can see the difference between using urllib module and BeautifulSoup module. Like the module, its beautify the html text. Now we can navigate the data structure.

>>> parsing_content.title
<title>
::CPCB Current Air Pollution Levels::
</title>
>>> parsing_content.title.name
‘title’
>>> parsing_content.title.string
‘\r\n\t::CPCB Current Air Pollution Levels::\r\n’

This is example just show you the basic of parsing of website. In the next post I will show you more.

No Comments

Trackbacks/Pingbacks

  1. How To Parse Website (2) - Andrey Ferriyan | Andrey Ferriyan - […] late post How to Parse Website Part 1 has shown you the basic website parsing. Next in this post…

Leave a Reply

Your email address will not be published. Required fields are marked *