When I first wrote Scrapemark, I wanted to take a completely different approach to parsing HTML documents. To me, the most painful aspect of using the existing methods of the day was extracting the data you wanted. Scrapemark’s innovation was that you could write the data extraction in an easy to understand “template language”. Actually, you might consider it a “reverse template language” because instead of inserting values, it extracted them.
I still consider this a very promising idea. It’s even more relevant 2 years later because developers are more accustomed to template languages through projects like Mustache and Handlebars.
The reverse-template idea is also cool because it could theoretically parse a large HTML/XML document without ever holding the entire document in memory. You could do it stream-style. It’s like SAX but with an API you’d actually want to use.
However, I’ve sadly come to the conclusion that I don’t have enough time to maintain this project. I’m focusing more on FullCalendar these days. I actually stopped working on this project a while ago, but I’m just now getting around to blogging about it.
Viable alternatives for parsing HTML with an easy-to-use extraction layer include soupselect and pyquery (see stackoverflow thread), but I still think there’s room in the world for a new-wave approach.
I encourage anyone who is interested to start work on their own library. Though if I were you, I wouldn’t work on top of the Scrapemark codebase because of certain fundamental flaws. It should probably based off a real HTML parser (like Python’s htmlparser), or better yet, another SAX-style parser that robustly handles malformed HTML. Also, I’d probably give the reverse-template syntax an overhaul and introduce some more control structures.
If you have any questions about the future of Scrapemark please contact me, or better yet, leave a comment for all to see.
Comments