<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>The GITS Blog &#187; programming</title>
	<atom:link href="http://www.ginstrom.com/scribbles/category/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<pubDate>Mon, 25 Aug 2008 05:54:18 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
	<language>en</language>
			<item>
		<title>Choices</title>
		<link>http://ginstrom.com/scribbles/2008/08/25/choices/</link>
		<comments>http://ginstrom.com/scribbles/2008/08/25/choices/#comments</comments>
		<pubDate>Mon, 25 Aug 2008 05:54:18 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[silly]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/?p=128</guid>
		<description><![CDATA[Here's a not-serious field guide for choosing a scripting language.
Perl
Choose perl if you think that scripting is for quick-n-dirty tasks, and your language ought to look like it. Or if you like how it looks when Snoopy swears.
Ruby
Choose ruby if you think that perl is pretty cool, but it should be even more powerful.
Python
Choose python [...]]]></description>
			<content:encoded><![CDATA[<p>Here's a not-serious field guide for choosing a scripting language.</p>
<h3>Perl</h3>
<p>Choose <a href="http://www.perl.org/">perl</a> if you think that scripting is for quick-n-dirty tasks, and your language ought to look like it. Or if you like how it looks when Snoopy swears.</p>
<h3>Ruby</h3>
<p>Choose <a href="http://www.ruby-lang.org/">ruby</a> if you think that perl is pretty cool, but it should be <a href="http://en.wikipedia.org/wiki/Ruby_programming_language#History">even more powerful</a>.</p>
<h3>Python</h3>
<p>Choose <a href="http://www.python.org/">python</a> if you think naming your throwaway variables "spam" and "egg" is really witty. Or if you want to write really cool stuff. (No bias here, no sir!)</p>
<h3>JavaScript</h3>
<p>Choose <a href="http://en.wikipedia.org/wiki/JavaScript">javascript</a> if you're cowboy hacker or a genius hacker (or both).</p>
<h3>VBScript</h3>
<p>Choose <a href="http://en.wikipedia.org/wiki/VBScript">vbscript</a> if there's no hope for you whatsoever.</p>
<h3>Lua</h3>
<p>Choose <a href="http://www.lua.org/">lua</a> if you like squeezing into small spaces.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/08/25/choices/feed/</wfw:commentRss>
		</item>
		<item>
		<title>WxPython 2.8.8.0 quietly introduces true ActiveX hosting for Windows</title>
		<link>http://ginstrom.com/scribbles/2008/08/20/wxpython-2880-quietly-introduces-true-activex-hosting-for-windows/</link>
		<comments>http://ginstrom.com/scribbles/2008/08/20/wxpython-2880-quietly-introduces-true-activex-hosting-for-windows/#comments</comments>
		<pubDate>Wed, 20 Aug 2008 10:46:49 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/08/20/wxpython-2880-quietly-introduces-true-activex-hosting-for-windows/</guid>
		<description><![CDATA[Version 2.8.8.0 of WxPython uses the new activex class to host ActiveX controls on Windows. This means that unlike previous implementations of the wrapper for the IE HTML ActiveX control, this version has full access to the browser events and DOM.
This is a huge advance for Windows GUI programming with WxPython. Now WxPython applications on [...]]]></description>
			<content:encoded><![CDATA[<p>Version 2.8.8.0 of <a href="http://www.wxpython.org/">WxPython</a> uses the new activex class to host ActiveX controls on Windows. This means that unlike previous implementations of the wrapper for the IE HTML ActiveX control, this version has full access to the browser events and DOM.</p>
<p>This is a huge advance for Windows GUI programming with WxPython. Now WxPython applications on Windows can host an IE window, and control the contents programmatically via the document property.</p>
<p>The WxPython team has been pretty quiet about it. The changes don't seem to have made it into the docs (although they're mentioned in the <a href="http://www.wxpython.org/recentchanges.php">change log</a> and the sample code has been updated).</p>
<p>Here's a simple example of a dialog box with an HTML window. The dialog intercepts clicks on the links, and uses them to set the color of the text by accessing its css property.</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="co1">#coding: UTF8</span><br />
<span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
ColorWindow</p>
<p>Demonstrates controlling DOM of IEHtmlWindow<br />
&quot;</span><span class="st0">&quot;&quot;</span></p>
<p><span class="kw1">import</span> wx<br />
<span class="kw1">from</span> wx.<span class="me1">lib</span> <span class="kw1">import</span> iewin<br />
<span class="kw1">from</span> wx.<span class="me1">lib</span> <span class="kw1">import</span> sized_controls as sc</p>
<p>HTML_DOCUMENT = u<span class="st0">&quot;&quot;</span><span class="st0">&quot;&lt;html&gt;<br />
&nbsp; &nbsp; &lt;body&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;p id=&quot;</span>text<span class="st0">&quot;&gt;Change the color of the text&lt;/p&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;p&gt;Actions:&lt;/p&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;ul&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;li&gt;&lt;a href=&quot;</span>/red<span class="st0">&quot;&gt;Make it red&lt;/a&gt;&lt;/li&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;li&gt;&lt;a href=&quot;</span>/blue<span class="st0">&quot;&gt;Make it blue&lt;/a&gt;&lt;/li&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &lt;li&gt;&lt;a href=&quot;</span>/green<span class="st0">&quot;&gt;Make it green&lt;/a&gt;&lt;/li&gt;<br />
&nbsp; &nbsp; &nbsp; &nbsp; &lt;/ul&gt;<br />
&nbsp; &nbsp; &lt;/body&gt;<br />
&lt;/html&gt;&quot;</span><span class="st0">&quot;&quot;</span></p>
<p><span class="kw1">class</span> ColorWindow<span class="br0">&#40;</span>sc.<span class="me1">SizedDialog</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Hosts an IEHtmlWindow, and responds to clicks on links<br />
&nbsp; &nbsp; by setting the color of the text.<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__init__</span><span class="br0">&#40;</span><span class="kw2">self</span>, parent<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">ie</span> = <span class="kw2">None</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; flag = wx.<span class="me1">DEFAULT_DIALOG_STYLE</span>|wx.<span class="me1">RESIZE_BORDER</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; sc.<span class="me1">SizedDialog</span>.<span class="kw4">__init__</span><span class="br0">&#40;</span><span class="kw2">self</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;parent,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="nu0">-1</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="st0">&quot;Color Window&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;style=flag,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;size=<span class="br0">&#40;</span><span class="nu0">300</span>,<span class="nu0">300</span><span class="br0">&#41;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">layout</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> layout<span class="br0">&#40;</span><span class="kw2">self</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Performs the layout of GUI widgets&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; pane = <span class="kw2">self</span>.<span class="me1">GetContentsPane</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">ie</span> = iewin.<span class="me1">IEHtmlWindow</span><span class="br0">&#40;</span>pane<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">ie</span>.<span class="me1">SetSizerProps</span><span class="br0">&#40;</span>expand=<span class="kw2">True</span>, proportion=<span class="nu0">1</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">ie</span>.<span class="me1">LoadString</span><span class="br0">&#40;</span>HTML_DOCUMENT<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">ie</span>.<span class="me1">AddEventSink</span><span class="br0">&#40;</span><span class="kw2">self</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">SetButtonSizer</span><span class="br0">&#40;</span><span class="kw2">self</span>.<span class="me1">CreateStdDialogButtonSizer</span><span class="br0">&#40;</span>wx.<span class="me1">OK</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; pane.<span class="me1">Fit</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> BeforeNavigate2<span class="br0">&#40;</span><span class="kw2">self</span>, this, pDisp, URL, Flags,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; TargetFrameName, PostData, Headers,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Cancel<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;<br />
&nbsp; &nbsp; &nbsp; &nbsp; This is a callback from the HTML window before<br />
&nbsp; &nbsp; &nbsp; &nbsp; navigating to a clicked link. We'll use it to set the<br />
&nbsp; &nbsp; &nbsp; &nbsp; color, then cancel.<br />
&nbsp; &nbsp; &nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; color = URL<span class="br0">&#91;</span><span class="nu0">0</span><span class="br0">&#93;</span>.<span class="me1">split</span><span class="br0">&#40;</span><span class="st0">&quot;/&quot;</span><span class="br0">&#41;</span><span class="br0">&#91;</span><span class="nu0">-1</span><span class="br0">&#93;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; elem = <span class="kw2">self</span>.<span class="me1">ie</span>.<span class="me1">document</span>.<span class="me1">getElementById</span><span class="br0">&#40;</span><span class="st0">&quot;text&quot;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; elem.<span class="me1">style</span>.<span class="me1">cssText</span> = <span class="st0">&quot;color: %s;&quot;</span> % color<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># cancel it so it doesn't actually try to navigate there</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; Cancel<span class="br0">&#91;</span><span class="nu0">0</span><span class="br0">&#93;</span> = <span class="kw2">True</span></p>
<p><span class="kw1">if</span> __name__ == <span class="st0">'__main__'</span>:<br />
&nbsp; &nbsp; application = wx.<span class="me1">PySimpleApp</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; window = ColorWindow<span class="br0">&#40;</span><span class="kw2">None</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; window.<span class="me1">ShowModal</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; window.<span class="me1">Destroy</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; application.<span class="me1">MainLoop</span><span class="br0">&#40;</span><span class="br0">&#41;</span></div>
<p><a href="/code/htmlwin.zip">Here's the code (htmlwin.zip)</a>.</p>
<p>The method BeforeNavigate2 intercepts clicks on links, and uses the link information to get the desired color. Then it finds the element in the DOM with an id of "text", and sets that element's color to the link color.</p>
<p>The possibilities of this technique are huge. Although it has the disadvantage of tying you to Windows, if your application is going to be Windows-only anyway, it lets you write an application with many of the benefits of a Web app, and very few of the drawbacks.</p>
<p>Here's a screenshot of the above code in action:<br />
<img src="/img/htmlwin_screenshot.png" alt="Screenshot of the HTML dialog" /></p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/08/20/wxpython-2880-quietly-introduces-true-activex-hosting-for-windows/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Another argument against rewrites</title>
		<link>http://ginstrom.com/scribbles/2008/08/15/another-argument-against-rewrites/</link>
		<comments>http://ginstrom.com/scribbles/2008/08/15/another-argument-against-rewrites/#comments</comments>
		<pubDate>Fri, 15 Aug 2008 01:21:41 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/08/15/another-argument-against-rewrites/</guid>
		<description><![CDATA[I read an interesting post about the launch of Delicious 2.0. When Yahoo! bought del.icio.us, they immediately rewrote the entire site in a new language (two, actually). The rewrite took three years, didn't add significant functionality, broke some old functionality, and introduced a bunch of new bugs &#8212; even after three years.
As Delicious stagnated, an [...]]]></description>
			<content:encoded><![CDATA[<p>I read an <a href="http://www.cincomsmalltalk.com/blog/blogView?showComments=true&#038;printTitle=The_Wages_of_Pointless_Rewrites&#038;entry=3394981268">interesting post</a> about the launch of <a href="http://delicious.com/">Delicious 2.0</a>. When Yahoo! bought del.icio.us, they immediately rewrote the entire site in a new language (two, actually). The rewrite took three years, didn't add significant functionality, broke some old functionality, and introduced a bunch of new bugs &#8212; even after three years.</p>
<p>As Delicious stagnated, an entire ecosystem of social bookmarking sites sprang up. Delicious squandered its huge lead in this space doing a rewrite, while it could have used that energy to keep innovating ahead of the me-too sites.</p>
<p>Let this be yet another lesson in avoiding <a href="http://chadfowler.com/2006/12/27/the-big-rewrite">the big rewrite</a>. Even if the codebase you inherit is really bad, if you've got a working system (which del.icio.us most certainly was), it's best to keep the existing system and refactor.</p>
<p>The rewrite is a seductively attractive solution. Just looking at the code makes your brain hurt? Dump it and start over! I've gone through my share of rewrites. I re-wrote my application <a href="http://felix-cat.com/">Felix</a> from the ground up twice. The first rewrite was needed &#8212; I wrote the first version while simultaneously learning C++ and Win32 programming &#8212; but the second could and should have been avoided.</p>
<p>I'll use the delicious debacle as another reminder that no matter how attractive the rewrite options seems, it's almost always a bad idea.</p>
<p>Incidentally, <a href="http://www.reddit.com/r/programming/comments/6v4yw/how_yahoo_dropped_the_delicious_ball_with_a/c04y9tn">it's been pointed out</a> that Yahoo! rewrote the original perl codebase into php and C++ (!), because those are its "institutional" languages. It sounds like Yahoo!'s inflexibility caused it to make a huge blunder.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/08/15/another-argument-against-rewrites/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Using custom functions with SQLAlchemy and SQLite</title>
		<link>http://ginstrom.com/scribbles/2008/08/10/using-custom-functions-with-sqlalchemy-and-sqlite/</link>
		<comments>http://ginstrom.com/scribbles/2008/08/10/using-custom-functions-with-sqlalchemy-and-sqlite/#comments</comments>
		<pubDate>Sat, 09 Aug 2008 14:43:10 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/08/10/using-custom-functions-with-sqlalchemy-and-sqlite/</guid>
		<description><![CDATA[I recently developed a Web-based translation memory (TM) application in Python. One thing the application does is fuzzy glossary matching: given a source sentence, it'll find all terms in the glossary that are fuzzy substrings of that sentence (using my fuzzy substring matching module, which is based on the Levenshtein distance algorithm), and return the [...]]]></description>
			<content:encoded><![CDATA[<p>I recently developed a <a href="http://felix-cat.com/tools/memory-serves/">Web-based translation memory (TM) application</a> in Python. One thing the application does is fuzzy glossary matching: given a source sentence, it'll find all terms in the glossary that are fuzzy substrings of that sentence (using my <a href="http://pypi.python.org/pypi/subdist/0.2.1">fuzzy substring matching module</a>, which is based on the <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> algorithm), and return the terms along with their translations.</p>
<p>Here's how I created a custom function for fuzzy glossary searches, using <a href="http://www.sqlalchemy.org/">SQLAlchemy</a> for the ORM, with <a href="http://www.sqlite.org/">SQLite</a> as the database engine. Assuming you've got your <a href="http://www.sqlalchemy.org/docs/04/session.html">SessionClass object</a>, create a session, and get a connection object:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">import</span> subdist</p>
<p><span class="kw1">def</span> make_gloss_func<span class="br0">&#40;</span>haystack<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Creates a fuzzy substring matcher using haystack<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span><br />
&nbsp; &nbsp; get_score = subdist.<span class="me1">get_score</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> gloss_func<span class="br0">&#40;</span>needle<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> get_score<span class="br0">&#40;</span>needle, haystack<span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">return</span> gloss_func</p>
<p><span class="kw1">class</span> TM<span class="br0">&#40;</span><span class="kw2">object</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Represents a translation memory (TM)/glossary&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="co1"># stuff skipped&#8230;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> gloss_search<span class="br0">&#40;</span><span class="kw2">self</span>, query, minscore<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Do a fuzzy glossary search.<br />
&nbsp; &nbsp; &nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; session = <span class="kw2">self</span>.<span class="me1">SessionClass</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># Create the custom function</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; gloss_func = make_gloss_func<span class="br0">&#40;</span>query<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; conn = session.<span class="me1">bind</span>.<span class="me1">connect</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; conn.<span class="me1">connection</span>.<span class="me1">create_function</span><span class="br0">&#40;</span><span class="st0">&quot;gloss_score&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="nu0">1</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; gloss_func<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># Execute the query</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; search_string = <span class="st0">&quot;&quot;</span><span class="st0">&quot;SELECT * FROM records<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; WHERE gloss_score(source)&gt;=:minscore&quot;</span><span class="st0">&quot;&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> conn.<span class="me1">execute</span><span class="br0">&#40;</span>search_string,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">dict</span><span class="br0">&#40;</span>minscore=minscore<span class="br0">&#41;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">finally</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">SessionClass</span>.<span class="me1">remove</span><span class="br0">&#40;</span><span class="br0">&#41;</span></div>
<h3>Speedup</h3>
<p>Just for fun, I compared the speed of (1) using custom functions in SQLite with (2) keeping the records as a Python array, and getting the matches in pure Python using a list comprehension. I found that the SQLAlchemy version is about 8 times faster. In the test, I created a glossary of 44,732 records using random word pairs, and got the fuzzy substrings for a query sentence.</p>
<table>
<tr>
<th>version</th>
<th>time</th>
</tr>
<tr>
<td>native Python</td>
<td>0.7837 s</td>
</tr>
<tr>
<td>SQLAlchemy</td>
<td>0.0966 s</td>
</tr>
</table>
<p>Since the fuzzy-matching code and database code are written in C, the SQLAlchemy version is probably approaching near-C speeds, with the only slowdown being the overhead of calling them from Python (which is pretty minimal; most of the work is done elsewhere).</p>
<p>More importantly, the SQLAlchemy version easily meets my performance target of a 50,000-record search in 0.25 seconds, while the native Python version falls pretty far short.</p>
<p>Also interestingly, I found that <a href="http://psyco.sourceforge.net/">psyco</a> didn't speed up either version at all, and in fact made both slightly slower. Another demonstration that you should profile rather than applying psyco as a panacea.</p>
<p>Here's the <a href="/code/speed_test.gz">code used for the test</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/08/10/using-custom-functions-with-sqlalchemy-and-sqlite/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Book Review: CherryPy Essentials</title>
		<link>http://ginstrom.com/scribbles/2008/08/05/book-review-cherrypy-essentials/</link>
		<comments>http://ginstrom.com/scribbles/2008/08/05/book-review-cherrypy-essentials/#comments</comments>
		<pubDate>Tue, 05 Aug 2008 05:12:39 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/08/05/book-review-cherrypy-essentials/</guid>
		<description><![CDATA[


I recently created a server application to share Felix memories/glossaries over a local network. After a simple test application, I was confident that CherryPy would serve my needs, so I bought CherryPy Essentials (Sylvain Hellgegouarch) and started hacking.

The story of a website
The book uses a practical-minded, show-me-the-code style that I personally like. The book tells [...]]]></description>
			<content:encoded><![CDATA[<div style="float:left; margin-right: 10px; margin-bottom: 10px">
<a href="http://www.cherrypyessentials.com/" style="border:none;" title="CherryPy Essential website"><img src="/img/cherrypyessentialscover.jpg" alt="CherryPy Essentials book cover" border="0" /></a>
</div>
<p>I recently created a <a href="http://felix-cat.com/tools/memory-serves/">server application</a> to share <a href="http://felix-cat.com/">Felix</a> memories/glossaries over a local network. After a <a href="http://felix-cat.com/tools/wordcount/">simple test application</a>, I was confident that <a href="http://www.cherrypy.org/">CherryPy</a> would serve my needs, so I bought <a href="http://www.cherrypyessentials.com/">CherryPy Essentials</a> (Sylvain Hellgegouarch) and started hacking.</p>
<p><br clear="all" /></p>
<h3>The story of a website</h3>
<p>The book uses a practical-minded, show-me-the-code style that I personally like. The book tells the story of a website, starting with an introduction to CherryPy, then walking us through the creation of a sophisticated Web 2.0 website, adding more elements as our understanding grows.</p>
<p>While I found the style of taking us through the development of a website useful, this approach has an obvious weakness: if you don't use the same components as the author, the book will be less relevant. For example, the author chooses <a href="http://www.aminus.net/dejavu">Dejavu</a> for his <a href="http://en.wikipedia.org/wiki/Object-relational_mapping">ORM</a>, while I prefer <a href="http://www.sqlalchemy.org/">SQLAlchemy</a>; he chooses <a href="http://www.kid-templating.org/">Kid</a> for his templating engine, while I prefer <a href="http://www.makotemplates.org/">Mako</a>; and he chooses <a href="http://mochikit.com/">MochiKit</a> for his JavaScript library, while I prefer <a href="http://jquery.com/">jquery</a>.</p>
<p>While this prevented me from using the book's code whole cloth, it did show me how to build a website the CherryPy way, using CherryPy's fantastic features and integrating them with the other technologies used in a modern Web application. In all I'm quite pleased with the value I got out of the book.</p>
<h3>Contents</h3>
<p>Chapter 1: Introduction to CherryPy<br />
Chapter 2: Download and Install CherryPy<br />
Chapter 3: Overview of CherryPy<br />
Chapter 4: CherryPy in Depth<br />
Chapter 5: A Photoblog Application<br />
Chapter 6: Web Services<br />
Chapter 7: The Presentation Layer<br />
Chapter 8: Ajax<br />
Chapter 9: Testing<br />
Chapter 10: Deployment</p>
<p><strong>Chapter 1: Introduction to CherryPy</strong></p>
<p>This chapter provides some introduction to CherryPy's history and community. It can be safely skipped by the impatient.</p>
<p><strong>Chapter 2: Download and Install CherryPy</strong></p>
<p>Not too much of interest here either. If you can find the book, I'm sure you can find CherryPy.</p>
<p><strong>Chapter 3: Overview of CherryPy</strong></p>
<p>Here's where things start getting interesting. We start with a simple "shout" style application, then walk through the anatomy of a basic CherryPy application: configuration, static files, URL routing, and page handlers.</p>
<p>The author also provides a list of the built-in library modules, which I found to be fairly useless since he doesn't tell us how to use them or even really much about what they might be good for.</p>
<p><strong>Chapter 4: CherryPy in Depth</strong></p>
<p>True to its title, this chapter talks in depth about the various capabilities and tools provided with CherryPy, including material on extending and hooking the CherryPy engine. If you've got experience developing websites, chapters 3 and 4 are really all you need to get cracking on a cool website with all the fixings.</p>
<p><strong>Chapter 5: A Photoblog Application</strong></p>
<p>For the remainder of the book, we'll be building a photo blogging application, fully Web 2.0 buzzword compliant with web services, JavaScript, and AJAX goodness.</p>
<p>The chapter provides a high-level outline of the application we'll be building, and then discusses some of the various ORMs available for Python.</p>
<p><strong>Chapter 6: Web Services</strong></p>
<p>This cool chapter describes how to build Web services (including RESTful services) using CherryPy. This served as a major inspiration for creating the API to my server, although I decided not to use REST.</p>
<p><strong>Chapter 7: The Presentation Layer</strong></p>
<p>This chapter goes over templating libraries for Python (the only thing more numerous in Python than templating libraries are Web frameworks) and how to integrate them with CherryPy, then settles on Kid as the engine to use for the application.</p>
<p><strong>Chapter 8: Ajax</strong></p>
<p>Although the author uses MochiKit as his JavaScript library, he describes the AJAX communication process at a fairly low level, so it was easy to transfer the concepts over to jquery. After briefly describing JSON as the mechanism for transferring data between JavaScript and Python, the chapter goes on to describe how we'll be using AJAX in our photo blogging application.</p>
<p><strong>Chapter 9: Testing</strong></p>
<p>You've got to give props to a framework book that devotes a chapter to testing. As with other components, although I prefer <a href="http://www.somethingaboutorange.com/mrl/projects/nose/">nose</a> and <a href="http://wwwsearch.sourceforge.net/mechanize/">mechanize</a> for unit/web testing to the author's <a href="http://docs.python.org/lib/module-unittest.html">unittest</a> and <a href="http://pythonpaste.org/webtest/">webtest</a>, the concepts were what mattered and I had no trouble following along.</p>
<p>I also appreciated the section on performance/load testing using <a href="http://funkload.nuxeo.org/">FunkLoad</a>, as well as function testing with <a href="http://selenium.openqa.org/">Selenium</a>, which I wasn't very familiar with.</p>
<p><strong>Chapter 10: Deployment</strong></p>
<p>This chapter is also pure gold in a framework book. It discusses the various deployment and configuration options, including deploying behind Apache and lighttpd. I found the section on supporting SSL to be particularly helpful, as my next step for my server will be hosting it myself via SSL.</p>
<h3>Conclusion</h3>
<p>This book is a great practical guide to building websites with CherryPy, using best practices for the framework and getting the most out of the huge array of functionality that it provides.</p>
<h3>The Pakt Publishing Ebook</h3>
<p>I bought this book as a PDF from <a href="http://www.packtpub.com/">Pakt Publishing</a>. I prefer having programming books in electronic format, because I typically read them at my computer while experimenting with the code in the book. And since I live in Japan, the electronic format is also a great way to get my grubby hands on technical books quickly and cheaply.</p>
<p>Pakt ebooks, I found, have an annoying security feature: copy and paste is disabled, thwarting my preferred tactic of copying and pasting code snippets to run, and terms to google. Sure, the code samples are provided, but that adds a lot of hassle to what should be a simple operation.</p>
<p>I wish Pakt would discontinue this practice. It only hurts paying customers, since the pirates will no doubt have little trouble stripping out that protection and repackaging the book in any format they like.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/08/05/book-review-cherrypy-essentials/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Python is for people who want to program</title>
		<link>http://ginstrom.com/scribbles/2008/07/24/python-is-for-people-who-want-to-program/</link>
		<comments>http://ginstrom.com/scribbles/2008/07/24/python-is-for-people-who-want-to-program/#comments</comments>
		<pubDate>Wed, 23 Jul 2008 22:48:55 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/07/24/python-is-for-people-who-want-to-program/</guid>
		<description><![CDATA[Saw a great quote the other day on comp.lang.python, in response to a troll questioning Python's usefulness in the "real" world:
Python is for people who want to program, not REAL WORLD programmers.
By Mensanator in comp.lang.python (Google groups link)
(Python encourages a sense of fun, and people on the comp.lang.python group tend to like to have fun [...]]]></description>
			<content:encoded><![CDATA[<p>Saw a great quote the other day on <a href="http://groups.google.com/group/comp.lang.python/topics">comp.lang.python</a>, in response to a troll questioning Python's usefulness in the "real" world:</p>
<blockquote><p>Python is for people who want to program, not REAL WORLD programmers.</p></blockquote>
<div style="text-align:right">By Mensanator <a href="http://groups.google.com/group/comp.lang.python/msg/71c069f822528251">in comp.lang.python (Google groups link)</a></div>
<p>(Python encourages a sense of fun, and people on the comp.lang.python group tend to like to have fun taking the piss out of trolls.)</p>
<p>Not that "real world" programmers don't use Python &#8212; just that people who place a lot of value on being a "real world" programmer are probably using Java or C# or something. Meanwhile, people use Python because it's a great language and they love programming in it, not because it will make people think they're "real world" programmers.</p>
<p>This goes back to the <a href="http://www.paulgraham.com/pypar.html">Python Paradox</a> described by Paul Graham: python programmers are generally good hires, not because Python is (necessarily) a better language than Java or C#, but because</p>
<blockquote><p>&#8230;people don't learn Python because it will get them a job; they learn it because they genuinely like to program and aren't satisfied with the languages they already know.</p></blockquote>
<p>This is becoming less of a sure thing as Python gains popularity and starts to look good on "real world" resumes, but I think it's still true that people use Python because they like it.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/07/24/python-is-for-people-who-want-to-program/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Stupid gender stereotyping and questionable coding practices 101</title>
		<link>http://ginstrom.com/scribbles/2008/07/06/stupid-gender-stereotyping-and-questionable-coding-practices-101/</link>
		<comments>http://ginstrom.com/scribbles/2008/07/06/stupid-gender-stereotyping-and-questionable-coding-practices-101/#comments</comments>
		<pubDate>Sun, 06 Jul 2008 07:13:24 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/07/06/stupid-gender-stereotyping-and-questionable-coding-practices-101/</guid>
		<description><![CDATA[The WSJ blog of all places has an article about how men and women code (and women do it better).
Apparently, "testosterone-fueled" men tend to write cryptic code without comments, while "touchy-feely and considerate" women tend to write lots of helpful comments and are just all-around nice people.
If you're going to make sweeping generalizations about how [...]]]></description>
			<content:encoded><![CDATA[<p>The WSJ blog of all places has an article about <a href="http://blogs.wsj.com/biztech/2008/06/06/men-write-code-from-mars-women-write-more-helpful-code-from-venus/">how men and women code (and women do it better)</a>.</p>
<p>Apparently, "testosterone-fueled" men tend to write cryptic code without comments, while "touchy-feely and considerate" women tend to write lots of helpful comments and are just all-around nice people.</p>
<p>If you're going to make sweeping generalizations about how men and women code, I want to see some real studies, not the personal observations of some random female engineer.</p>
<p>The random engineer in question is Emma McGrattan, a high-up muckety-muck about Silicon Valley and the sole source quoted in the article, who</p>
<blockquote><p>&#8230;boasts that 70% to 80% of the time, she can look at a chunk of computer code and tell if it was written by a man or a woman.</p></blockquote>
<p>But that's not very impressive. 80% of the programmers at her company are men &#8212; she could guess "man" every time and have an 80% chance of being right. The ratio of men to women is even higher in the general programmer population (in the English speaking world, at least). So this means that the accuracy of her spidey senses is less than or equal to chance.</p>
<p>Even granting the premise that women add more comments to their code than men, I have my doubts about whether copious commenting is particularly helpful or even a good idea. That's why I'm not especially enthusiastic about her company's coding standards:</p>
<blockquote><p>
In an effort to make Ingres’s computer code more user-friendly and gender-neutral, McGrattan helped institute new coding standards at the company. They require programmers to include a detailed set of comments before each block of code explaining what the piece of code does and why; developers also must supply a detailed history of any changes they have made to the code. The rules apply to both Ingres employees and members of the open-source community who contribute code to Ingres’s products.</p></blockquote>
<div style="width: 50%; float:right; border: 2px solid #922; background-color; #222; font-style: italic; margin-left: 10px; margin-top:10px; margin-bottom: 10px; padding:5px;">
<strong>Spot the logical error corner</strong><br />
So, she created gender-neutral coding standards by enforcing the coding style that<br />
she thinks women prefer?
</div>
<p>A detailed history of any changes they have made? In the comments? Isn't that what source control is for? Please don't tell me that this major company has replaced its source-control system with code comments. The alternative is that this standard breaks the DRY principle (don't repeat yourself) by duplicating source-control information in the comments.</p>
<p>That's not the only problem with this standard. The big problem with comments is that they lie. Code, on the other hand, never lies. It does what it says it does (albeit sometimes cryptically). It's very common for programmers to modify the code without bothering to modify the comments; the comments, therefore, get out of date.</p>
<p>It's much better, in my opinion, to document code in a way that can never get out of date: with unit tests. I know it's a cliché, but I believe that comments should tell you the why, and unit tests should tell you the what and the how of the code.</p>
<p>At any rate, this article still gave me a warm fuzzy because the draconian standards imposed by Emma give lie to the depressing stereotypes that the article espouses.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/07/06/stupid-gender-stereotyping-and-questionable-coding-practices-101/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Programming as craft</title>
		<link>http://ginstrom.com/scribbles/2008/07/02/programming-as-craft/</link>
		<comments>http://ginstrom.com/scribbles/2008/07/02/programming-as-craft/#comments</comments>
		<pubDate>Wed, 02 Jul 2008 06:03:47 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/07/02/programming-as-craft/</guid>
		<description><![CDATA[Paul Graham famously compared programming to painting, claiming that programming in its highest form (i.e. what he calls "hacking") is equivalent to art.
This seems to resonate with a lot of programmers, and not just because it makes us feel better about ourselves to believe that we're really creating art when we code our 83rd login [...]]]></description>
			<content:encoded><![CDATA[<p>Paul Graham <a href="http://www.paulgraham.com/hp.html">famously compared programming to painting</a>, claiming that programming in its highest form (i.e. what he calls "hacking") is equivalent to art.</p>
<p>This seems to resonate with a lot of programmers, and not just because it makes us feel better about ourselves to believe that we're really creating art when we code our 83rd login page. I think it resonates because we get an intense feeling of creative accomplishment when we send our programs out into the world.</p>
<p>My mother works at an art institute in San Francisco. Once in a while the students there will sell some art at a gallery, or to my mom; when they make a sale, they look a lot like how I feel when someone buys <a href="http://felix-cat.com/">my software</a> or <a href="http://www.swet.jp/index.php/weblog/meet_felix_a_clever_new_cat_tool_made_by_gits/">says nice things about it</a>.</p>
<p>But my software isn't art:</p>
<blockquote><p>Art refers to a diverse range of human activities, creations, and expressions that are appealing or attractive to the senses or have some significance to the mind of an individual. </p></blockquote>
<p style="text-align: right"><a href="http://en.wikipedia.org/wiki/Art">Wikipedia</a></p>
<p>And that's not the purpose of my software. Just about all the software I write is intended to make things easier for translators and other people who work with text. So while I still think that what I do is creative, its main purpose is practical.</p>
<p>Some software is art, like games and such. Making those kinds of programs is making art. But that business data entry form you're coding isn't art, sorry (unless of course you're really into <a href="http://en.wikipedia.org/wiki/Franz_Kafka">Kafka</a>).</p>
<p>Indeed, most programming is less an artistic effort and more one of craftsmanship. I think that more programmers don't make this connection between programming and craftsmanship because in the developed world, the "crafts" have become so industrialized that they don't resemble what we do at all. While programming is a high-tech industry, it's definitely preindustrial. Programming isn't like 20 tables rolling off the assembly line every hour; it's more like you go into the cobbler's, and you can either get a pair of shoes that's a bit too small, or one that's a bit too large, or pay an exorbitant amount and wait a week for a custom pair to be made, which probably won't fit perfectly anyway. Heck, we programmers still have to make many of our own tools! Not very industrialized.</p>
<p>I once lived in central Ohio. In the next town over from us, there was an Amish furniture shop, where they made most of their furniture on the premises. I loved looking at and trying out the hand-made rocking chairs they had. I try to emulate those furniture makers in my programming: paying great attention to function, but making my programs pleasing to use as well. I think it's a more useful target to shoot for than <a href="http://en.wikipedia.org/wiki/Picasso">Picasso</a>.</p>
<p>After all, what would be the point of a cubist word processor?</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/07/02/programming-as-craft/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The mysterious &#8220;ImportError: cannot import name cache&#8221;</title>
		<link>http://ginstrom.com/scribbles/2008/06/17/the-mysterious-importerror-cannot-import-name-cache/</link>
		<comments>http://ginstrom.com/scribbles/2008/06/17/the-mysterious-importerror-cannot-import-name-cache/#comments</comments>
		<pubDate>Tue, 17 Jun 2008 10:47:03 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/06/17/the-mysterious-importerror-cannot-import-name-cache/</guid>
		<description><![CDATA[The scenario: I'm packaging a CherryPy server with py2exe, using Mako as my template engine. So I create my exe file, and fire up the app, and get this 500 Mako error:

Traceback (most recent call last):
&#160;&#160;File "cherrypy\_cprequest.pyo", line 551, in respond
&#160;&#160;File "cherrypy\_cpdispatch.pyo", line 24, in __call__
&#160;&#160;File "main.py", line 210, in index
&#160;&#160;File "mako\lookup.pyo", line 70, in [...]]]></description>
			<content:encoded><![CDATA[<p>The scenario: I'm packaging a <a href="http://www.cherrypy.org/">CherryPy</a> server with <a href="http://www.py2exe.org/">py2exe</a>, using <a href="http://www.makotemplates.org/">Mako</a> as my template engine. So I create my exe file, and fire up the app, and get this 500 Mako error:<br />
<span style="color:red"><br />
Traceback (most recent call last):<br />
&nbsp;&nbsp;File "cherrypy\_cprequest.pyo", line 551, in respond<br />
&nbsp;&nbsp;File "cherrypy\_cpdispatch.pyo", line 24, in __call__<br />
&nbsp;&nbsp;File "main.py", line 210, in index<br />
&nbsp;&nbsp;File "mako\lookup.pyo", line 70, in get_template<br />
&nbsp;&nbsp;File "mako\lookup.pyo", line 112, in __load<br />
&nbsp;&nbsp;File "mako\template.pyo", line 74, in __init__<br />
&nbsp;&nbsp;File "&#8230;", line 1, in &lt;module&gt;<br />
ImportError: cannot import name cache</span></p>
<p>Since the app was working fine without py2exe, I brilliantly deduced that maybe something was getting broken in the packaging process.</p>
<p>It turns out that I need to include "mako.cache" in my packages, <a href="http://koobmeej.blogspot.com/2008/03/cherrypy-mako-and-py2exe.html">as pointed out by this blog post</a>.</p>
<p>So the relevant section of my setup dictionary now looks like this:</p>
<div class="dean_ch" style="white-space: wrap;">
&nbsp; &nbsp; excludes = <span class="br0">&#91;</span><span class="st0">&quot;pywin&quot;</span>, <span class="st0">&quot;pywin.debugger&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;pywin.debugger.dbgcon&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;pywin.dialogs&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;pywin.dialogs.list&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;win32com.server&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;Tkinter&quot;</span><span class="br0">&#93;</span><br />
&nbsp; &nbsp; packages=<span class="br0">&#91;</span><span class="st0">&quot;email&quot;</span>, <span class="st0">&quot;lxml&quot;</span>, <span class="st0">&quot;mako.cache&quot;</span><span class="br0">&#93;</span></p>
<p>&nbsp; &nbsp; options = <span class="kw2">dict</span><span class="br0">&#40;</span>optimize=<span class="nu0">2</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;dist_dir=comp_name,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;excludes=excludes,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;packages=packages<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; setup_dict<span class="br0">&#91;</span><span class="st0">'options'</span><span class="br0">&#93;</span> = <span class="br0">&#123;</span><span class="st0">&quot;py2exe&quot;</span>:options<span class="br0">&#125;</span></div>
<p>Everything's working now &#8212; lovely stuff!</p>
<p>Googling on this error failed to produce any hits, so I had to (<em>gasp</em>!) do some actual research to figure this out. Here's hoping that I can save the next poor soul from that horror.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/06/17/the-mysterious-importerror-cannot-import-name-cache/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Counting words (etc.) in an HTML file with Python</title>
		<link>http://ginstrom.com/scribbles/2008/05/17/counting-words-etc-in-an-html-file-with-python/</link>
		<comments>http://ginstrom.com/scribbles/2008/05/17/counting-words-etc-in-an-html-file-with-python/#comments</comments>
		<pubDate>Sat, 17 May 2008 00:50:38 +0000</pubDate>
		<dc:creator>Ryan Ginstrom</dc:creator>
		
		<category><![CDATA[programming]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/05/17/counting-words-etc-in-an-html-file-with-python/</guid>
		<description><![CDATA[In a previous post, I wrote about how to count words, characters, and Asian characters using python.
In this post I want to pull that together with code to get a word count from an HTML file.
What needs counting
What needs counting depends to some extent on what you need the word count for, but here I'm [...]]]></description>
			<content:encoded><![CDATA[<p>In a previous post, I wrote about <a href="/scribbles/2007/10/06/counting-words-characters-and-asian-characters-with-python/">how to count words, characters, and Asian characters using python</a>.</p>
<p>In this post I want to pull that together with code to get a word count from an HTML file.</p>
<h2>What needs counting</h2>
<p>What needs counting depends to some extent on what you need the word count for, but here I'm going to be assuming that the word count is going to be used to count billable/localizable content.</p>
<p>In that scenario, you've got to count the text in the title tag, as well as the visible text in the body, and certain other localizable content: <code>img</code> <code>alt</code> attributes, <code>a</code> <code>title</code> attributes, and <code>input</code> <code>value</code> attributes (am I missing any?).</p>
<h2>The Code</h2>
<p>The code for counting the actual text is in the above link. Here we need code to extract the text from the HTML file, and to accumulate the counts for all the chunks we've extracted.</p>
<p>Here's the Segment class for accumulating counts:</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="kw1">class</span> Segment<span class="br0">&#40;</span><span class="kw2">object</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Represents a text segment.<br />
&nbsp; &nbsp; (For bookkeeping)<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__init__</span><span class="br0">&#40;</span><span class="kw2">self</span>, text=<span class="st0">&quot;&quot;</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot; text is the segment of text we will calculate.<br />
&nbsp; &nbsp; &nbsp; &nbsp; Leave it empty if this will be a master count for a document</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; @param text: The text of the segment<br />
&nbsp; &nbsp; &nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">characters</span> = <span class="kw2">len</span><span class="br0">&#40;</span>text<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; num_spaces = <span class="kw2">len</span><span class="br0">&#40;</span><span class="br0">&#91;</span>x <span class="kw1">for</span> x <span class="kw1">in</span> text <span class="kw1">if</span> x.<span class="me1">isspace</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#93;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">chars_no_spaces</span> = <span class="kw2">self</span>.<span class="me1">characters</span> - num_spaces</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">asian_chars</span> = <span class="kw2">len</span><span class="br0">&#40;</span><span class="br0">&#91;</span>x <span class="kw1">for</span> x <span class="kw1">in</span> text <span class="kw1">if</span> is_asian<span class="br0">&#40;</span>x<span class="br0">&#41;</span><span class="br0">&#93;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">non_asian_words</span> = non_j_len<span class="br0">&#40;</span>text<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">words</span> = <span class="kw2">self</span>.<span class="me1">non_asian_words</span> + <span class="kw2">self</span>.<span class="me1">asian_chars</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> accumulate<span class="br0">&#40;</span><span class="kw2">self</span>, seg<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Add the stats from &lt;seg&gt; to this one.<br />
&nbsp; &nbsp; &nbsp; &nbsp; Use this to keep a count for the entire document;<br />
&nbsp; &nbsp; &nbsp; &nbsp; use another for the whole batch of documents</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; @param seg: The segment to accumulate</p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; &gt;&gt;&gt; seg = Segment(u&quot;</span><span class="st0">&quot;)<br />
&nbsp; &nbsp; &nbsp; &nbsp; &gt;&gt;&gt; seg2 = Segment(u&quot;</span>abc<span class="st0">&quot;)<br />
&nbsp; &nbsp; &nbsp; &nbsp; &gt;&gt;&gt; seg.accumulate(seg2)<br />
&nbsp; &nbsp; &nbsp; &nbsp; &gt;&gt;&gt; seg.words<br />
&nbsp; &nbsp; &nbsp; &nbsp; 1<br />
&nbsp; &nbsp; &nbsp; &nbsp; &gt;&gt;&gt; seg.characters<br />
&nbsp; &nbsp; &nbsp; &nbsp; 3<br />
&nbsp; &nbsp; &nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">words</span> += seg.<span class="me1">words</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">characters</span> += seg.<span class="me1">characters</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">chars_no_spaces</span> += seg.<span class="me1">chars_no_spaces</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">asian_chars</span> += seg.<span class="me1">asian_chars</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">non_asian_words</span> += seg.<span class="me1">non_asian_words</span></div>
<p>Next, the code for extracting (segmenting) the text from an HTML file. For this, you'll need <a href="http://www.crummy.com/software/BeautifulSoup/">the excellent Beautiful Soup module</a>.</p>
<div class="dean_ch" style="white-space: wrap;">
<span class="co1">#coding: UTF8</span><br />
<span class="st0">&quot;&quot;</span><span class="st0">&quot;Html segmenter&quot;</span><span class="st0">&quot;&quot;</span></p>
<p><span class="kw1">from</span> BeautifulSoup <span class="kw1">import</span> BeautifulSoup as bsoup<br />
<span class="kw1">from</span> BeautifulSoup <span class="kw1">import</span> BeautifulStoneSoup<br />
<span class="kw1">import</span> <span class="kw3">re</span></p>
<p><span class="kw1">def</span> normalize<span class="br0">&#40;</span>text<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Normalize whitepace in C{text}.</p>
<p>&nbsp; &nbsp; &gt;&gt;&gt; normalize(u&quot;</span> &nbsp; spam\\n\\tspam &nbsp; SPAM<span class="st0">&quot;)<br />
&nbsp; &nbsp; u'spam spam SPAM'<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> u<span class="st0">' '</span>.<span class="me1">join</span><span class="br0">&#40;</span>text.<span class="me1">split</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#41;</span></p>
<p><span class="kw1">class</span> Segmenter<span class="br0">&#40;</span><span class="kw2">object</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Html segmenter<br />
&nbsp; &nbsp; Retrieves the editable/translatable text from an HTML document.<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__init__</span><span class="br0">&#40;</span><span class="kw2">self</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Set up various regular expressions for splitting the text&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">pre_parse_stripper</span> = <span class="kw3">re</span>.<span class="kw2">compile</span><span class="br0">&#40;</span>u<span class="st0">&quot;|&quot;</span>.<span class="me1">join</span><span class="br0">&#40;</span><span class="br0">&#91;</span>u<span class="st0">&quot;&lt;body*?&gt;|&lt;/body&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;a[<span class="es0">\s</span><span class="es0">\S</span>]*?&gt;|&lt;/a&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;img[<span class="es0">\s</span><span class="es0">\S</span>]*?&gt;|&lt;/img&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;input[<span class="es0">\s</span><span class="es0">\S</span>]*?&gt;|&lt;/input&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;script*?&gt;[<span class="es0">\s</span><span class="es0">\S</span>]*?&lt;/script&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;form[<span class="es0">\s</span><span class="es0">\S</span>]*?&gt;|&lt;/form&gt;&quot;</span><span class="br0">&#93;</span><span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="kw3">re</span>.<span class="me1">I</span> | <span class="kw3">re</span>.<span class="me1">M</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Strip out unsightly tags before heading to the splitter&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">splitter</span> = <span class="kw3">re</span>.<span class="kw2">compile</span><span class="br0">&#40;</span>u<span class="st0">'|'</span>.<span class="me1">join</span><span class="br0">&#40;</span><span class="br0">&#91;</span>u<span class="st0">&quot;&lt;p*?&gt;|&lt;/p&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;div*?&gt;|&lt;/div&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;td*?&gt;|&lt;/td&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;li*?&gt;|&lt;/li&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;h<span class="es0">\d</span>*?&gt;|&lt;/h<span class="es0">\d</span>&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;dd*?&gt;|&lt;/dd&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;dt*?&gt;|&lt;/dt&gt;&quot;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;u<span class="st0">&quot;&lt;br*?&gt;&quot;</span><span class="br0">&#93;</span><span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="kw3">re</span>.<span class="me1">I</span> | <span class="kw3">re</span>.<span class="me1">M</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Split segments by certain tags (removing tags in bargain)<br />
&nbsp; &nbsp; &nbsp; &nbsp; These tags indicate a segment boundary&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">charset_finder</span> = <span class="kw3">re</span>.<span class="kw2">compile</span><span class="br0">&#40;</span>u<span class="st0">'[<span class="es0">\s</span><span class="es0">\S</span>]*&lt;meta[<span class="es0">\s</span><span class="es0">\S</span>]*?charset<span class="es0">\s</span>*=<span class="es0">\s</span>*([<span class="es0">\S</span>]+)&quot;[<span class="es0">\s</span><span class="es0">\S</span>]*?&gt;[<span class="es0">\s</span><span class="es0">\S</span>]*'</span>, <span class="kw3">re</span>.<span class="me1">I</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Find the charset if necessary&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">soup</span> = <span class="kw2">None</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> <span class="kw4">__str__</span><span class="br0">&#40;</span><span class="kw2">self</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;So we can tell which segger we have (assuming multiple segmenter classes)&quot;</span><span class="st0">&quot;&quot;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="st0">&quot;HTML&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> get_chunks<span class="br0">&#40;</span><span class="kw2">self</span>, html_text<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Extract the text from the HTML file&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">soup</span> = bsoup<span class="br0">&#40;</span>html_text, fromEncoding=<span class="kw2">self</span>.<span class="me1">getEncoding</span><span class="br0">&#40;</span>html_text<span class="br0">&#41;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># document title</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="kw2">self</span>.<span class="me1">soup</span>.<span class="me1">head</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; title = <span class="kw2">self</span>.<span class="me1">soup</span>.<span class="me1">head</span>.<span class="me1">title</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> title:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">yield</span> title.<span class="kw3">string</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># image alt attributes, anchor title attributes, input value attributes</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> tag, attr <span class="kw1">in</span> <span class="br0">&#40;</span><span class="br0">&#40;</span>u<span class="st0">&quot;img&quot;</span>, u<span class="st0">&quot;alt&quot;</span><span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#40;</span>u<span class="st0">&quot;a&quot;</span>, u<span class="st0">&quot;title&quot;</span><span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#40;</span>u<span class="st0">&quot;input&quot;</span>, u<span class="st0">&quot;value&quot;</span><span class="br0">&#41;</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> <span class="kw3">chunk</span> <span class="kw1">in</span> <span class="kw2">self</span>.<span class="me1">getAttributes</span><span class="br0">&#40;</span>tag, attr<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="kw3">chunk</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">yield</span> <span class="kw3">chunk</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># Parse the body text</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="kw2">self</span>.<span class="me1">soup</span>.<span class="me1">body</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; text = <span class="kw2">self</span>.<span class="me1">pre_parse_stripper</span>.<span class="me1">sub</span><span class="br0">&#40;</span>u<span class="st0">&quot;&quot;</span>, <span class="kw2">unicode</span><span class="br0">&#40;</span><span class="kw2">self</span>.<span class="me1">soup</span>.<span class="me1">body</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> <span class="kw3">chunk</span> <span class="kw1">in</span> <span class="kw2">self</span>.<span class="me1">splitter</span>.<span class="me1">split</span><span class="br0">&#40;</span>text<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; normal = normalize<span class="br0">&#40;</span>html2plain<span class="br0">&#40;</span><span class="kw3">chunk</span><span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> normal:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">yield</span> normal</p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> getAttributes<span class="br0">&#40;</span><span class="kw2">self</span>, tagName, attrName<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Get all attrName values for tagName tags&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; attrs = <span class="br0">&#91;</span><span class="br0">&#93;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; tags = <span class="kw2">self</span>.<span class="me1">soup</span>.<span class="me1">findAll</span><span class="br0">&#40;</span>tagName<span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> tag <span class="kw1">in</span> tags:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">try</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; attr = tag<span class="br0">&#91;</span>attrName<span class="br0">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> attr:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; attrs.<span class="me1">append</span><span class="br0">&#40;</span>attr<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">except</span> <span class="kw2">KeyError</span>, e:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1">#print &quot;Tag %s does not have attribute %s&quot; % (tagName, attrName)</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">pass</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> attrs</p>
<p>&nbsp; &nbsp; <span class="kw1">def</span> getEncoding<span class="br0">&#40;</span><span class="kw2">self</span>, text<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Retrieve the encoding META tag, if present&quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; &nbsp; &nbsp; m = <span class="kw2">self</span>.<span class="me1">charset_finder</span>.<span class="me1">match</span><span class="br0">&#40;</span>text<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> m:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> m.<span class="me1">groups</span><span class="br0">&#40;</span><span class="nu0">0</span><span class="br0">&#41;</span><span class="br0">&#91;</span><span class="nu0">0</span><span class="br0">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span> <span class="kw2">None</span></p>
<p>
TAG_STRIPPER = <span class="kw3">re</span>.<span class="kw2">compile</span><span class="br0">&#40;</span>u<span class="st0">&quot;&lt;[!<span class="es0">\w</span>/][<span class="es0">\s</span><span class="es0">\S</span>]*?&gt;&quot;</span>, <span class="kw3">re</span>.<span class="me1">I</span> | <span class="kw3">re</span>.<span class="me1">M</span><span class="br0">&#41;</span></p>
<p><span class="kw1">def</span> strip_tags<span class="br0">&#40;</span>line<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;strip the HTML tags from the line</p>
<p>&nbsp; &nbsp; &gt;&gt;&gt; strip_tags(u&quot;</span>&lt;b&gt;spam&lt;/b&gt;<span class="st0">&quot;)<br />
&nbsp; &nbsp; u'spam'</p>
<p>&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">return</span> TAG_STRIPPER.<span class="me1">sub</span><span class="br0">&#40;</span>u<span class="st0">&quot;&quot;</span>, line<span class="br0">&#41;</span></p>
<p><span class="kw1">def</span> html2plain<span class="br0">&#40;</span>text<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; <span class="st0">&quot;&quot;</span><span class="st0">&quot;Strips out tags from HTML text</p>
<p>&nbsp; &nbsp; &gt;&gt;&gt; html2plain('spam &lt;b&gt;eggs&lt;/b&gt;')<br />
&nbsp; &nbsp; u'spam<span class="es0">\\</span>xa0eggs'<br />
&nbsp; &nbsp; &gt;&gt;&gt; html2plain('&#8211;&gt;')<br />
&nbsp; &nbsp; u'&#8211;&gt;'<br />
&nbsp; &nbsp; &quot;</span><span class="st0">&quot;&quot;</span></p>
<p>&nbsp; &nbsp; entities = BeautifulStoneSoup.<span class="me1">HTML_ENTITIES</span><br />
&nbsp; &nbsp; text = <span class="kw2">unicode</span><span class="br0">&#40;</span>BeautifulStoneSoup<span class="br0">&#40;</span>strip_tags<span class="br0">&#40;</span>text<span class="br0">&#41;</span>,<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; convertEntities=entities<span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">return</span> text.<span class="me1">replace</span><span class="br0">&#40;</span>u<span class="st0">&quot;&amp;#38;gt;&quot;</span>, <span class="st0">&quot;&gt;&quot;</span><span class="br0">&#41;</span>.<span class="me1">replace</span><span class="br0">&#40;</span>u<span class="st0">&quot;&amp;#38;lt;&quot;</span>, <span class="st0">&quot;&lt;&quot;</span><span class="br0">&#41;</span></div>
<p>And here's some code to get the actual wordcount:</p>
<div class="dean_ch" style="white-space: wrap;">
&nbsp; &nbsp; wordcount = docstats.<span class="me1">Segment</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; segger = htmlseg.<span class="me1">Segmenter</span><span class="br0">&#40;</span><span class="br0">&#41;</span></p>
<p>&nbsp; &nbsp; <span class="kw1">for</span> <span class="kw3">chunk</span> <span class="kw1">in</span> segger.<span class="me1">get_chunks</span><span class="br0">&#40;</span><span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">&quot;thefile.html&quot;</span><span class="br0">&#41;</span>.<span class="me1">read</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; wordcount.<span class="me1">accumulate</span><span class="br0">&#40;</span>docstats.<span class="me1">Segment</span><span class="br0">&#40;</span><span class="kw3">chunk</span><span class="br0">&#41;</span><span class="br0">&#41;</span></div>
<p>Here are the <a href="/code/html_wordcount.tar.gz">docstats and htmlseg modules</a>, and here is an <a href="http://felix-cat.com/tools/wordcount/">online tool using the code for the HTML word counts</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ginstrom.com/scribbles/2008/05/17/counting-words-etc-in-an-html-file-with-python/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
