My late post How to Parse Website Part 1 has shown you the basic website parsing. Next in this post I will show you more. Open your python command line again then try parse our last website. Here is the result from website I’ve been parse accessed via browser.
This is Central Pollution Control Board to monitor air quality in certain area. Now, what if I want to grab the parameter from the table ?, like Nitric Oxide, Nitrogen dioxide, Ozone and etc. ?. Check our last script again at How to Parse Website Part 1. This was the last command in python command line.
>>> parsing_content = bs(content, “html.parser”)
Let BeautifulSoup do the rest for parsing all the content. Now it’s our turn getting deeper. Using browser check the source website (Ctrl+U). Find content which parameters reside.
Now we have the information where is “Nitric Oxide”, “Nitrogen dioxide” and etc. Because we want to localize the table so here is the command for accessing the html tag.
>>> table = parsing_content.find(‘td’, attrs={‘id’ : ‘Td1’})
>>> table
We localize even more using filter. From the source we know the value is between <td> and </td>. Use this command to show only the <td> html tag.
>>> table.findAll(‘td’)
The result from findAll function is list from everything with <td> and </td> html tag. You can try access the index list with number using this command. As you know that list in python start with 0.
>>> table.findAll(‘td’) [0]
>>> table.findAll(‘td’)[7]
Voila !. You have access “Nitric Oxide”. How about the others ?. Use this command line.
>>> counter = 7
>>> for link in range(len(table.findAll(‘td’))):
try:
counter = counter+7
print(table.findAll(‘td’)[counter])
except IndexError:
exit();
You have all value within parameters column. But there is some odd value. We just need the value without html tag. How to remove it ?. The get_text() function will clear them all. Here is the complete script.
from bs4 import BeautifulSoup as bs
import urllib.request
url = “http://www.cpcb.gov.in/CAAQM/frmCurrentDataNew.aspx?CityId=7&StationName=Hyderabad&StateId=30“
f = urllib.request.urlopen(url)
content = f.read()
f.close()
parsing_content = bs(content,”html.parser”)
table = parsing_content.find(‘td’, attrs={‘id’ : ‘Td1’})
print(table.findAll(‘td’)[7].get_text())
counter = 7
for link in range(len(table.findAll(‘td’))):
try:
counter = counter+7
print(table.findAll(‘td’)[counter].get_text())
except IndexError:
exit();