Clinton donor’s list extracted with python (BeautifulSoup vs lxml)

A favour was asked to compile all 2922 pages of the clinton donor’s list into a single text file for easy access when doing research. I used this as an opportunity to compare the performance of BeautifulSoup and lxml for scraping.

First things first, I used free download manager’s batch download function to download the files into a series of html files called donors-list1.html … donors-list2922.html and I printed the output using BeautifulSoup’s prettify function and discovered that the main table containing the donor’s list was in a table with the style “margin-left: 95px;” defined and all the subsequent entries are “td” entries so I hacked up the following script, testing it first with 100 entries.

from BeautifulSoup import BeautifulSoup

output = open("c:tempclintonlist.txt",'w')

for fileNo in xrange(1,100):

    # open html file
    file = open("C:Downloadsclintondonors-list%d.html" % fileNo)
   
    html = file.read()

    # parse with beautiful soup
    soup = BeautifulSoup(html)

    # look for all children of tag with attribute of margin-left: 95px
    for i in soup.findAll(style="margin-left: 95px;")[0]:

        #then find a td entry
        x = i.find("td")
        try:

            # if not a padding cell
            if x != -1:
                   
                # if entry is something like "$200 to $300" add extra lines
                if x.string[0] == "$":
                    output.write("n")
                    output.write(x.string)
                    output.write( "nn")

                # else just print it out
                else:
                    output.write(x.string)
                    output.write("n")

        # This error comes up occasionally, needs investigating
        except TypeError:
            pass
           
    file.close()

Running this script took an incredible amount of time, so I ran it through cProfile and came up with the result “13460574 function calls (12914824 primitive calls) in 91.940 CPU seconds” wow, that’s almost 1 second per file.

So I tried to see if I can optimize it even further, perhaps by using “find” instead of findAll for the table and using tree navigation instead of “find” for accessing the “td” entries.

Replacing “findAll” with “find” resulted in 71 CPU seconds but replcing i.find(“td”) with an iterator increased it by another 2 seconds! Profiling shows alot of time spent in recursiveChildGenerator. The code at this stage looks like this:


    output = open("c:tempclintonlist.txt",'w')

    for fileNo in xrange(1,100):

        # open html file
        file = open("C:Downloadsclintondonors-list%d.html" % fileNo)
       
        html = file.read()

        # parse with beautiful soup
        soup = BeautifulSoup(html)

        # look for tags with attribute of margin-left: 95px
        start = soup.find(style="margin-left: 95px;")

        for i in start:
            outputString = ""
            try:
                if i.td.string[0] == "$":
                    outputString.append("n")
               
                outputString.append(i.td.String)
                outputString.append("n")
                output.write(outputString)
            except AttributeError:
                pass
       

        file.close()
   
    output.close()

Ok, lets try using findAll td, maybe it will give better results:

for i in start.findAll("td"):
    outputString = ""

Result, 72 seconds. Somehow I don’t think it’s going to get better than this.

Lets try again with lxml using the follwing code.

output = open("c:tempclintonlist.txt",'w')

for fileNo in xrange(1,100):

    # parse html file   
    tree = etree.parse("C:Downloadsclintondonors-list%d.html" % fileNo, etree.HTMLParser())
   
    for elt in tree.getiterator('td'):
        outputString = []
        try:
       
            result = elt.text.encode("utf-8")
           
            if result[0] == "$":
                    outputString.append("n")
               
            outputString.append(result)
            outputString.append("n")
            output.write(''.join(outputString))
        except AttributeError:
            pass
   

output.close()

This is incredible, the result is 0.427 seconds! This shows that lxml is much more efficient than BeautifulSoup. I know which tool I’ll be using for parsing my xml files next time

Mishari's Blog

Clinton donor’s list extracted with python (BeautifulSoup vs lxml)

Leave a Reply Cancel reply