<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: More on Screen Scraping</title>
	<atom:link href="http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/</link>
	<description>Researchin' the day away...</description>
	<pubDate>Mon, 13 Oct 2008 00:12:32 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: Yahya Cheema</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-36253</link>
		<dc:creator>Yahya Cheema</dc:creator>
		<pubDate>Mon, 29 Sep 2008 07:51:29 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-36253</guid>
		<description>Thanks Ross! I found this very informative and useful.</description>
		<content:encoded><![CDATA[<p>Thanks Ross! I found this very informative and useful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ross</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-12</link>
		<dc:creator>Ross</dc:creator>
		<pubDate>Fri, 16 Dec 2005 23:23:19 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-12</guid>
		<description>Right, Gr&#230;me recoded his scraper in Python and it ran well, so it looks like the currently crappy port of Beautiful Soup into Ruby is to blame.</description>
		<content:encoded><![CDATA[<p>Right, Gr&aelig;me recoded his scraper in Python and it ran well, so it looks like the currently crappy port of Beautiful Soup into Ruby is to blame.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ross</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-11</link>
		<dc:creator>Ross</dc:creator>
		<pubDate>Fri, 16 Dec 2005 00:47:26 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-11</guid>
		<description>I've spent the night putting together my parser, or at least a rough cut of it, for the Dublin Bus site. It's in pretty good shape now. Running on a page like the &lt;a href="http://www.dublinbus.ie/your_journey/viewer.asp?route=46a" rel="nofollow"&gt;46A schedule&lt;/a&gt;, which has a &lt;em&gt;frightening&lt;/em&gt; number of nested tables, it extracts the times from each table and collates them into a Python list in just over one second (including the page fetch). The source page is pretty atrociously coded too, so that's not the problem. I'm not familiar with Ruby, so I don't know what could be slowing your one down so much. :-S</description>
		<content:encoded><![CDATA[<p>I&#8217;ve spent the night putting together my parser, or at least a rough cut of it, for the Dublin Bus site. It&#8217;s in pretty good shape now. Running on a page like the <a href="http://www.dublinbus.ie/your_journey/viewer.asp?route=46a" rel="nofollow">46A schedule</a>, which has a <em>frightening</em> number of nested tables, it extracts the times from each table and collates them into a Python list in just over one second (including the page fetch). The source page is pretty atrociously coded too, so that&#8217;s not the problem. I&#8217;m not familiar with Ruby, so I don&#8217;t know what could be slowing your one down so much. <img src='http://www.yourhtmlsource.com/phdblog/smilies/msn_weird.png' alt='&#58;&#45;&#83;' class='wp-smiley' width='21' height='21' title='&#58;&#45;&#83;' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Graeme</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-10</link>
		<dc:creator>Graeme</dc:creator>
		<pubDate>Thu, 15 Dec 2005 17:33:22 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-10</guid>
		<description>Nice article Ross. I looked into the Ruby version of Beautiful Soup (&lt;a href="http://www.crummy.com/software/RubyfulSoup/"&gt;Rubiful Soup&lt;/a&gt;) earlier today, but found it to suffer from large processing overheads  (not far short of 60 seconds) when running it on a page with a months worth of flight information (presumably because it builds an complete object model of the page). What kind of noticible overheads (if any) did you experience when you were using the python version? What size of pages were you parsing?

Back to regular expressions for the moment I think!</description>
		<content:encoded><![CDATA[<p>Nice article Ross. I looked into the Ruby version of Beautiful Soup (<a href="http://www.crummy.com/software/RubyfulSoup/">Rubiful Soup</a>) earlier today, but found it to suffer from large processing overheads  (not far short of 60 seconds) when running it on a page with a months worth of flight information (presumably because it builds an complete object model of the page). What kind of noticible overheads (if any) did you experience when you were using the python version? What size of pages were you parsing?</p>
<p>Back to regular expressions for the moment I think!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ross</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-9</link>
		<dc:creator>Ross</dc:creator>
		<pubDate>Thu, 15 Dec 2005 13:44:03 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-9</guid>
		<description>Those requirements seem ridiculously draconian, given that editors at the IMDb aren't even responsible for the majority of the content on the site. Akin to a book publisher stipulating that people may read their books only while patting their heads and saying a Hail Mary.

They do say they'll grant &lt;a href="http://www.imdb.com/help/show_article?conditions" rel="nofollow"&gt;express written consent&lt;/a&gt; for some people to use robots, if asked nicely. Google had to ask for permission to crawl the site, apparently. ^o)</description>
		<content:encoded><![CDATA[<p>Those requirements seem ridiculously draconian, given that editors at the IMDb aren&#8217;t even responsible for the majority of the content on the site. Akin to a book publisher stipulating that people may read their books only while patting their heads and saying a Hail Mary.</p>
<p>They do say they&#8217;ll grant <a href="http://www.imdb.com/help/show_article?conditions" rel="nofollow">express written consent</a> for some people to use robots, if asked nicely. Google had to ask for permission to crawl the site, apparently. <img src='http://www.yourhtmlsource.com/phdblog/smilies/msn_sarcastic.png' alt='&#94;&#111;&#41;' class='wp-smiley' width='21' height='21' title='&#94;&#111;&#41;' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aaron</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-8</link>
		<dc:creator>Aaron</dc:creator>
		<pubDate>Wed, 14 Dec 2005 23:01:09 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-8</guid>
		<description>No scraping for me: FTP all the way (maybe?) 
http://www.imdb.com/help/show_leaf?usedatasoftware

Anyone else interested in this should see if similar policies apply before blindly scraping content.</description>
		<content:encoded><![CDATA[<p>No scraping for me: <acronym title="File Transfer Protocol">FTP</acronym> all the way (maybe?)<br />
<a href="http://www.imdb.com/help/show_leaf?usedatasoftware" rel="nofollow">http://www.imdb.com/help/show_leaf?usedatasoftware</a></p>
<p>Anyone else interested in this should see if similar policies apply before blindly scraping content.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aaron</title>
		<link>http://www.yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-7</link>
		<dc:creator>Aaron</dc:creator>
		<pubDate>Wed, 14 Dec 2005 22:55:19 +0000</pubDate>
		<guid isPermaLink="false">http://yourhtmlsource.com/phdblog/2005/12/14/more-on-screen-scraping/#comment-7</guid>
		<description>Nice info, very good starting point. I personally would like to figure out if IMDB will block too many queries from the same IP. Did you ever contact the sites you scrapped or just hoped for the best? I don't think a site like ugc.ie will do any fancy IP logging.</description>
		<content:encoded><![CDATA[<p>Nice info, very good starting point. I personally would like to figure out if IMDB will block too many queries from the same <acronym title="Internet Protocol">IP</acronym>. Did you ever contact the sites you scrapped or just hoped for the best? I don&#8217;t think a site like ugc.ie will do any fancy <acronym title="Internet Protocol">IP</acronym> logging.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
