Scrapemark

Adam Shaw

Most of the time, you’ll be doing something like this:

from scrapemark import scrape

scrape("""
       your pattern here
       """,
       url='http://someurl.com/')

However, for the sake of these examples, we will be passing the html argument to scrapemark.scrape(). The html argument will have the following string value:

<html>
<head>
<title>The Site Title :: The Page Title</title>
<META HTTP-EQUIV=REFRESH CONTENT="1; URL=http://otherurl.com/">
</head>
<body>

<ul id='nav' class='section'>
<li><span><a href='home.html'>Home</a></span></li>
<li><span><a href='about.html'>About</a></span></li>
<li><span><a href='photos.html'>Photos</a></span></li>
</ul>

<div id='content' class='section'>
Look at these data points
<table>
<tr><th>Day</th><th>Test 1</th><th>Test 2</th></tr>
<tr><td>1</td><td>5.6</td><td>24.5</td></tr>
<tr><td>2</td><td>1.1</td><td>12.8</td></tr>
<tr><td>3</td><td>2.4</td><td>5.67</td></tr>
</table>
</div>

<div id='footer' class='section'>
<a href='disclaimer.html'>Disclaimer</a> | <a href='contact.html'>Contact</a>
</div>

</body>
</html>

scrape some text:

>>> scrape("""
        <title>:: {{ page_title }}</title>
        """,
        html)

>>> {'page_title': 'The Page Title'}

scrape some text (quick version):

>>> scrape("""
        <title>:: {{ }}</title>
        """,
        html)

>>> 'The Page Title'

loop over certain divs, scrape a list:

>>> scrape("""
        <body>
        {*
                <div class='section' id='{{ [section_ids] }}' />
        *}
        </body>
        """,
        html)

>>> {'section_ids': ['nav', 'content', 'footer']}

scrape text before a certain element:

>>> scrape("""
        <div id='content'>
        {{ before_table }}
        <table />
        </div>
        """,
        html)

>>> {'before_table': 'Look at these data points'}

scrape a column from a table (as a list of ints):

>>> scrape("""
        <table>
        <tr />
        {*
                <tr>
                <td>{{ [day_numbers]|int }}</td>
                </tr>
        *}
        </table>
        """,
        html)

>>> {'day_numbers': [1, 2, 3]}

scrape the entire table with nested loops and dot-notation:

>>> scrape("""
        <table>
        <tr />
        {*
                <tr>
                <td>{{ [days].number|int }}</td>
                {*
                        <td>{{ [days].[points]|float }}</td>
                *}
                </tr>
        *}
        </table>
        """,
        html)

>>> {'days': [{'number': 1, 'points': [5.6, 24.5]},
              {'number': 2, 'points': [1.1, 12.8]},
              {'number': 3, 'points': [2.4, 5.67]}]}

preserve HTML when you scrape:

>>> scrape("""
        <div id='footer'>{{ footer|html }}</div>
        """,
        html)

>>> {'footer': "<a href='disclaimer.html'>Disclaimer</a> |
                                    <a href='contact.html'>Contact</a>"}

visit another page and scrape it:

>>> scrape("""
        <head>
        <meta http-equiv='refresh' content='url={@

                <title>{{ title }}</title>

        @}'/>
        </head>
        """,
        html)

>>> {'title': 'whatever the title of http://otherurl.com/ is'}