How To Parse Website
Python

How To Parse Website (2)

My late post How to Parse Website Part 1 has shown you the basic website parsing. Next in this post I will show you more. Open your python command line again then try parse our last website. Here is the result from website I’ve been parse accessed via browser.

Central Pollution Control Board

This is Central Pollution Control Board to monitor air quality in certain area. Now, what if I want to grab the parameter from the table ?, like Nitric Oxide, Nitrogen dioxide, Ozone and etc. ?. Check our last script again at How to Parse Website Part 1. This was the last command in python command line.

>>> parsing_content = bs(content, “html.parser”)

Let BeautifulSoup do the rest for parsing all the content. Now it’s our turn getting deeper. Using browser check the source website (Ctrl+U). Find content which parameters reside.

How To Parse Website 2

Now we have the information where is “Nitric Oxide”, “Nitrogen dioxide” and etc. Because we want to localize the table so here is the command for accessing the html tag.

>>> table = parsing_content.find(‘td’, attrs={‘id’ : ‘Td1’})

>>> table

Screenshot 2016-01-13 12.56.48

We localize even more using filter. From the source we know the value is between <td> and </td>. Use this command to show only the <td> html tag.

>>> table.findAll(‘td’)

How To Parse Website 2

The result from findAll function is list from everything with <td> and </td> html tag. You can try access the index list with number using this command. As you know that list in python start with 0.

>>> table.findAll(‘td’) [0]

How To Parse Website 2

>>> table.findAll(‘td’)[7]

How To Parse Website 2

Voila !. You have access “Nitric Oxide”. How about the others ?. Use this command line.

>>> counter = 7

>>> for link in range(len(table.findAll(‘td’))):

try:

counter = counter+7

print(table.findAll(‘td’)[counter])

except IndexError:

exit();

How To Parse Website 2

You have all value within parameters column. But there is some odd value. We just need the value without html tag. How to remove it ?. The get_text() function will clear them all. Here is the complete script.

from bs4 import BeautifulSoup as bs

import urllib.request

url = “http://www.cpcb.gov.in/CAAQM/frmCurrentDataNew.aspx?CityId=7&StationName=Hyderabad&StateId=30

f = urllib.request.urlopen(url)

content = f.read()

f.close()

parsing_content = bs(content,”html.parser”)

table = parsing_content.find(‘td’, attrs={‘id’ : ‘Td1’})

print(table.findAll(‘td’)[7].get_text())

counter = 7

for link in range(len(table.findAll(‘td’))):

try:

counter = counter+7

print(table.findAll(‘td’)[counter].get_text())

except IndexError:

exit();

Leave a Reply

Your email address will not be published. Required fields are marked *