A favour was asked to compile all 2922 pages of the clinton donor’s list into a single text file for easy access when doing research. I used this as an opportunity to compare the performance of BeautifulSoup and lxml for scraping.
First things first, I used free download manager’s batch download function to download the files into a series of html files called donors-list1.html … donors-list2922.html and I printed the output using BeautifulSoup’s prettify function and discovered that the main table containing the donor’s list was in a table with the style “margin-left: 95px;” defined and all the subsequent entries are “td” entries so I hacked up the following script, testing it first with 100 entries.
from BeautifulSoup import BeautifulSoup
output = open("c:tempclintonlist.txt",'w')
for fileNo in xrange(1,100):
# open html file
file = open("C:Downloadsclintondonors-list%d.html" % fileNo)
html = file.read()
# parse with beautiful soup
soup = BeautifulSoup(html)
# look for all children of tag with attribute of margin-left: 95px
for i in soup.findAll(style="margin-left: 95px;")[0]:
#then find a td entry
x = i.find("td")
try:
# if not a padding cell
if x != -1:
# if entry is something like "$200 to $300" add extra lines
if x.string[0] == "$":
output.write("n")
output.write(x.string)
output.write( "nn")
# else just print it out
else:
output.write(x.string)
output.write("n")
# This error comes up occasionally, needs investigating
except TypeError:
pass
file.close()
Running this script took an incredible amount of time, so I ran it through cProfile and came up with the result “13460574 function calls (12914824 primitive calls) in 91.940 CPU seconds” wow, that’s almost 1 second per file.
So I tried to see if I can optimize it even further, perhaps by using “find” instead of findAll for the table and using tree navigation instead of “find” for accessing the “td” entries.
Replacing “findAll” with “find” resulted in 71 CPU seconds but replcing i.find(“td”) with an iterator increased it by another 2 seconds! Profiling shows alot of time spent in recursiveChildGenerator. The code at this stage looks like this:
output = open("c:tempclintonlist.txt",'w')
for fileNo in xrange(1,100):
# open html file
file = open("C:Downloadsclintondonors-list%d.html" % fileNo)
html = file.read()
# parse with beautiful soup
soup = BeautifulSoup(html)
# look for tags with attribute of margin-left: 95px
start = soup.find(style="margin-left: 95px;")
for i in start:
outputString = ""
try:
if i.td.string[0] == "$":
outputString.append("n")
outputString.append(i.td.String)
outputString.append("n")
output.write(outputString)
except AttributeError:
pass
file.close()
output.close()
Ok, lets try using findAll td, maybe it will give better results:
for i in start.findAll("td"):
outputString = ""
Result, 72 seconds. Somehow I don’t think it’s going to get better than this.
Lets try again with lxml using the follwing code.
output = open("c:tempclintonlist.txt",'w')
for fileNo in xrange(1,100):
# parse html file
tree = etree.parse("C:Downloadsclintondonors-list%d.html" % fileNo, etree.HTMLParser())
for elt in tree.getiterator('td'):
outputString = []
try:
result = elt.text.encode("utf-8")
if result[0] == "$":
outputString.append("n")
outputString.append(result)
outputString.append("n")
output.write(''.join(outputString))
except AttributeError:
pass
output.close()
This is incredible, the result is 0.427 seconds! This shows that lxml is much more efficient than BeautifulSoup. I know which tool I’ll be using for parsing my xml files next time
Copyright © 2008. All Rights Reserved.