Travelling down a stack of dependency woes – How to parse HTML in Windows with Python

I was hoping I could parse HTML in Python in Windows. As it turned out, every step I tried ended up leading to another step. In case you are about to lose an entire day dealing with all these steps, I wrote them here.

  1. Problem 1: Beautiful Soup isn’t supported anymore
    Beautiful Soup is the de facto HTML parser. Beloved by Python programmers, it’s capable of dealing with broken and messy HTML. Sadly, the libraries that it used are being replaced, and the main developer doesn’t have time to work on it anymore.
    Solution: This was the easiest problem to deal with. I asked the New York Python Meetup, and they all recommended lxml.
  2. Problem 2: lxml doesn’t have a Python 2.7 build
    The easy solution – “easy_install lxml” – is supposed to get an egg file precompiled with lxml’s dependencies (at least, says the INSTALL file in the download).
    There were two problems:
    1. It doesn’t
    2. None of the .exes on the site are for Python 2.7.
    Solution: As it turns out, there’s a way around this dilemma; someone’s posted a script to build it online. It’s only for 32-bit though, it seems, but I gave it a short spin anyway.
  3. Problem 3: Cython error
    Solution: This fix was easy.
  4. Problem 4: “vcvarsall.bat” missing
    As it turns out, building many Python packages requires vcvarsall.bat, which is probably a compiling tool of some kind in Microsoft’s toolchain. The fix that comes up in search engine results involve hacking in a different compiler (gcc from MinGW), which I suspected might cause other incompatibilities.
    Solution: After talking with a friend from Microsoft, I determined that downloading and installing the Windows SDK would be a good place to start. Though that didn’t work, I did end up installing Visual Studio C++ Express, which did include vcvarsall.bat.
  5. Problem 5: “vcvarsall.bat” missing
    For some unknown reason, even after adding vcvarsall to the path,  the error still came up.
    Solution: It was at this point that I realized that the build script was for 32-bit. If I was going to go through the trouble of trying it, maybe it would be worth trying a 32-bit precompiled exe, which I ended up discovering on the same site I visited earlier.
  6. Problem 6: Installation didn’t work
    Even though the install went find and “import lxml” worked without a hitch, the lxml package was strangely empty – there was nothing in it!
    Solution: I went through site-packages and cleaned it out – there were two separate lxmls in there from my previous experiments. Removing one of them cleared it all up.

(I recently had an experience where I couldn’t beta test some software because it was built for 32-bit, but my computer was 64-bit. As it turned out, that didn’t apply here.)

Possible lesson: If something is unlikely to work, but is easy and quick, try it anyway.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>