last updated:17 Oct 2003 19:20 UK time
Joel On Software Discussion Forum
JOS Statistics - Recent Comments
(Comments added for week ending Sun 12 Oct 2003) | View Other Weeks
Web Site Indexing/Searching Recommendations | Sun 12 Oct | steved
I want to provide a search capability: - some parts of site are password-protected - may have multiple base urls - free is best, but can pay - server is IIS Looking for recommendations and tales from the field on: - Appl Serv Provider-type solutions (site is indexed by third party) - server based solutions (site is indexed by server-installed software)
Sun 12 Oct | Ankur | Why not use the built-in Windows Indexing Service?
Lightweight MS Office Components | Sun 12 Oct | Sam I Am
I have a web app that Im designing that will generate Word documents w/ embedded Excel charts. We do not have a separate app server, and the web admin is reluctant to install the full versions of Word and Excel on the web server.  Anyone know if there are lightweight MS Office components that I could use instead?
Sun 12 Oct | Ori Berger | There are many toolkits that generate Excel files, but I'm not aware of any that generates .doc. You might get one of the Excel toolkits and use .rtf files instead of the .doc files. Alternatively, you can just create an HTML file with everything - both Excel and Word will happily open it (and allow the user to save it to their native formats, if she is so inclined). And, if the concern about installing office on the server is cost/licensing, you might consider installing Open Office instead; It's scriptable, and will happily read and write Office files. You can use it as a batch converter from any of its supported formats to any other. The native Open Office format is a well documented XML format, which will probably be easy for you to generate.
Sun 12 Oct | Mike Gunderloy | If Office 2003 is an option, you can just generate WordML files, which are XML that open fine in Word. But all the clients would need 2003 to make this feasible.
Sun 12 Oct | John Ridout | Have a look at this http://officewriter.softartisans.com/
Parameters to check at startup | Sun 12 Oct | Dave B.
What are some system Parameters that you check when your program is loaded? I check to see if Windows was booted in Safe Mode and I check to make sure there is a mouse attached to the system and also to make sure only one instance of the program is running. Are there any other system (or other) parameters that you guys have learned (from experience) to check? I realize some things are program dependent, but Im looking for OS parameters that give clues as to potential problems.
Sun 12 Oct | Chuck | I check the amount of free memory and the OS version in addition to the 3 parameters you mentioned.
Sun 12 Oct | Joel Spolsky | We always check the IE version to make sure it's 5.01 or later, both because we rely on some IE components and because upgrades to IE also included upgrades to various system components and common controls. Anything older than IE 5.01 and you're dealing with a computer that hasn't been updated in a long time which is likely to have all kinds of dumb issues.
Sun 12 Oct | x | Although this is program specific we check the Hard Drive Free Space and the Processor Make and Model.  We also check to ensure that the program is started from a hard disk drive (i.e. not a removable drive).  You could also check to ensure a keyboard is installed, but i don't believe IBM PC's will boot without one anyway, though you could still  do so to check the type of keyboard installed if you need that information.  Personally, I think it just makes sense to check most if not all of the things listed in this thread, especially for shrinkwrapped applications.
Sun 12 Oct | x | You may also check for a printer that is physically connected to the machine and maybe if the machine is connected to a network. (Though I don't have the code to do either.)
Sun 12 Oct | FireMode (I was kidding) | You can check that the computer power is on and that the user is alive. Here is the code (I put it in the public domain): int computerPowerIsOn() { return 1; } int userIsAlive() { char c; printf('Are you alive? (Y/N): '); c=getchar(); if (toupper(c)=='Y') return 1; Call911('A zombie is hacking me'); return 0; }
Sun 12 Oct | Brad Wilson | 'You could also check to ensure a keyboard is installed, but i don't believe IBM PC's will boot without one anyway, though you could still do so to check the type of keyboard installed if you need that information.' Every PC I've had for, well, at least a decade has supported starting without a keyboard attached. Servers for sure very rarely have keyboards attached when they're booted, and keyboards (AT, PS/2, or USB) can be attached at any time and will function normally.
Sun 12 Oct | Sum Dum Gai | Why do you care whether a keyboard and/or mouse is attached? What possible difference could it make to youre program? Especially given you can emulate a mouse with the keyboard with functionality built into windows, and I'm pretty sure you can do the opposite with 3rd party software.
Sun 12 Oct | somebody | I *hate* software that tries to check the system before running to make sure everything fits its definition of OK. I used to have software that wouldn't run if a printer wasn't attached even though printing wasn't something I'd ever want to do from this. My conclusion was that the programmers were too lazy to make sure their printing code didn't break when there was no printer. Since I had a laptop, this was a problem for me. Not surprisingly, that company is not in business today. Another example of 'helpful' system checks is the classic story of the NT check that was common in games when NT didn't support higher versions of DirectX. Typically, a game install would check the OS and prevent installation on NT. Then Microsoft came out with the next version of NT (Windows 2000) and it did support higher versions of DirectX and suddenly you couldn't play games on your system simply because so many game developers did such a stupid system check and offered no way of bypassing it. What good is something like checking for a mouse?
Sun 12 Oct | Dave B. | >> 'What good is something like checking for a mouse?' The 'mouse check' and most (but not all) of my other checks are simply made for support reasons. I have learned that you cannot trust what the customer is telling you (over the phone). They could be doing what you ask and telling you what they see on the screen or they could be playing solitaire and totally making things up. All of my checks are 'silent'. The program continues to function normally and the user is never informed that these checks are done. I would never halt the execution of the program because, for example, the printer or the mouse was not installed or attached to the system. This would obviously not make for a professional running system. I would however inform the user that they need to install a printer before trying to print using the program. In fact even if the user starts the computer in safe mode, the program will not complain. It will run as best it can. However if they call me for support and I see that their computer is running in safe mode, then I have an idea as to where to start troubleshooting.
A solution to spam? | Sun 12 Oct | r1ch
Ive been wondering recently whether there really is a way that spam can be solved.  Ive seen many systems proposed and although none ever seem perfect, the common theme that does seem to have promise is to make it more expensive for the spammers.  I saw an idea a while ago that there should be a stamp fee for sending email - if it was made small then it wouldnt be a significant impediment to your average user, but it might deter companies from sending spam as the costs would soon multiply up.  Unfortunately, there are real downsides to this - what about all of those free email lists and newsletters that can be really useful - who would pay for them?  Also, it wouldnt cut out spam completely - we all get junk mail at home even though stamps have to be paid for, right?  So Im wondering if some kind of deposit system could work.  What if we had a system so that when we sent an email we entered into a kind of contract with the addressee so that if the addressee decided that the email was unsolicited they could claim the deposit that the sender had left.  That way, it would cost a fortune to send unsolicited email, but normal email wouldnt be affected.  Obviously, the system would be open to abuse, but its based on a contract so abusers would be punishable by courts.  Maybe the idea could be extended to include whitelists - email could be accepted without the need for a deposit from recognised (and authenticated?) senders.  Is this a stupid idea, or does anyone here think that it could work?  Maybe even if the system itself could work getting the infrastructure in place would make it unfeasible.
Sun 12 Oct | Dennis Atkins | Taxing email won't work because you'd only be able to tax emai loriginating in the country wehre the legislation passes. THe result would be that people in other countries would be able to send all the email you want at no cost, all the spammers would move their servers offshore, or just switch to distributing mail through trojan'd servers. (In which case, you would be personally liable to pay your the spam originating from your compromised computer). Result: no change in spam, but you'd stop getting email from people you want to hear from.
Sun 12 Oct | JX | Yes, taxing e-mail will work. Why? Because the recipient can filter: if (the sender hasn't made a 10 cents deposit) AND (the sender isn't in the white list) then reject_email_without_even_showing_it_to_the_user
Sun 12 Oct | Li-fan Chen | * With popular and highly useful lists, it's possible that advertisers pays for them. A soft plug is all it takes. * And also, stamps can be made slightly cheaper based on reputation. It could be a sliding scale. In a database you could have a 4 field table like this: Verisign-Key-Hex64, Company-Name-English, Campaign-Name-English, Rating And various campaigns from various companies can earn their rating. The better ones will always have a 7 out of 10 on average (the rating answers will have to come from direct subscribers). The lower the rating the more closer the price matches the price of the consumers. The higher the rating the closer to paying say 1% of the price. So if a vendor has two salesman.. one sells in a honorable way.. sending out emails that's basically 5% soft plugs and 95% useful helpful information .. then he'll probably earn a 7 or higher rating. Another salesman will send out a campaign that 50% hard sell and 50% useful information.. and earns a lower rating of 3 out of 10.. The idea is the first salesman is associated with a campaign name.. so that the second salesman's crap doesn't affect the receptivity of the first salesman. That way if you have a few bad apples in a company, it won't ruin it for the entire company. For example.. DoubleClick has many email marketers working with the DCLK dartmail system. Anyone of these could be banned from all servers even though they were sending out permission-based helpful emails, because they all share the same source IP in the dartmail email deployment system. But with proposed changes, dartmail will assign a unique source id to each email marketers--hoping to keep the good apples from the bad. For this to work it will have to be possible to build a digest from the Verisign-Key-Hex64, Company-Name-English, Campaign-Name-English fields, and have users vote on them. When users vote they send packets like Verisign-Key-Hex64, Company-Name-English, Campaign-Name-English, Encryted(UserID), Rating and a central voting poll will forward a list of UserIds to each of the companys (by looking up a webservices://www.Company-Name-English/Campaign-Name-English/userverify.asp web services for example) and get back a valid userid check.. when a UserID is determined to be valid.. their votes will affect the outcome of the salesman's rating and the pricing of future campaigns. So the better your campaign is seem to the consumers, the cheaper it gets. The crappier it is the more expensive it gets. If you target your consumers properly. Don't send crap to them if they don't ask for it. Send things they actually want to read. You do it cheap. Otherwise, sellers beware.
Sun 12 Oct | pb | Cloudmark.
Sun 12 Oct | Adam Spitz | You might enjoy reading Paul Graham's website: http://www.paulgraham.com I've been using a spam filter based on his algorithm for a while now. I've had thousands of e-mails since I installed it, about half of them spam, and only four of the spam messages slipped through. There were zero false positives during the first couple of weeks; after that I started to trust the filter so much that I don't bother to check anymore. :)
any one can recommend a good selectable image back | Sun 12 Oct | leonard
My master Hard drive is a 20G.B Seagate. I have a 4 GB W.D removable drive. 15 GB in my master H.D. are all kind of video movies, big Internet cash and some non-important file folders. In case of crash I can permit to give up those 15 G.B I want to make an image of the 5 important G.B in my H.D as a full restorable backup (including my Win98 operating system). For this I need software that can do a selectable image of my master H.D from Seagate to the removable W.D drive. Did any one can recommend a good selectable image backup software?
Sun 12 Oct | Frederic Faure | Used Norton Ghost and DriveImage. Both are good and easy to use.
CMS: Support for WYSIWYG editing? | Sat 11 Oct | Frederic Faure
Hi, As far as I know, there are two solutions to let users add contents to a web site through a CMS: - The CityDesk way, which is WYSIWYG but requires generating the whole site before uploading it to a remote web server. Here, the site is static. - The server-side CMS, where users are expected to type raw text in a textarea. They must not prepare their contribution in a WYSIWYG HTML editor, as those do not generated raw content, but rather add the HTML stuff necessary for a single, stand-alone document... which we dont want it this article is to be delivered through the CMS. So... do you know of any good server-side CMS that offers a WYSIWYG interface to users, either as an enhanced web browser (HTMLEDIT box?) or a dedicated client app that sends contents over the wire to the CMS through eg. WebDAV or XML-RPC/SOAP? Thanks for any tip
Sat 11 Oct | Philo | http://www.richtextbox.com is a .Net WYSIWYG HTML-editing component. Not sure if this helps. Philo
Sat 11 Oct | Frederic Faure | Thx Philo for the tip :-) Unfortunately, Richtextbox only runs with IIS + .Net. If I can help it, I'd rather use a non-proprietary solution. Do CMS makers _really expect_ users to type pages of text in a textarea? :-)
Sat 11 Oct | Philo | Non-proprietary? Or non-Microsoft? 'Nonproprietary' is really going to limit your options for CMS. Philo
Sat 11 Oct | Brad Wilson | There is a Flash-based WYSIWYG HTML editor that I've seen, used, and like. Unfortunately, the name eludes me, but I'm sure Google can give up the ghost.
Sat 11 Oct | www.marktaw.com | Download Radio and install it, it's got a JavaScript WYSIWYG editor that launches in your browser, IE only. So does eBay. It uses the same dhtmledit control that CityDesk does. You can call it from a web page. There are some others that use this DHTML Edit control. Actually, I'm not positive about Radio, it might be the pay version of Frontier. I played with it a bit, and was able to isolate the code that did it, you should be able to take that and learn how to do it yourself.
Sat 11 Oct | Brad Wilson | Radio does have it. RichTextBox (recommended above) is a .NET wrapper around it. The Flash based one is a little more universally accessible, if you can't dictate the browser.
Sun 12 Oct | www.marktaw.com | Yeah... that Flash version would have that benefit huh.
Sun 12 Oct | Herbert Sitz | Here's one that's available for free. It works only under IE, though: http://www.interactivetools.com/products/htmlarea/documentation.html?htmlarea#intro1
Sun 12 Oct | Frederic Faure | Thx for the tips :-) As for Radio, I tried a few times... but, besides the occasional GPF, I could never figure it out. It seems filled with features, and couldn't find a good tutorial to figure it out. Maybe I'll give it another try. Does anyone have experience with XML-RPC? Should I just forget about building a VB app that sends contents to a web server through this protocol?
Sun 12 Oct | Troy King | No CMS, but it's an excellent 100% script WYSIWYG client-side editor. It's also pretty inexpensive and is developer-based licensing, not server-based -- you can use it in as many projects as you like with just the one $70 license. I use it in several applications, and have found the source quite easy to modify (it's javascript). It requires IE pretty much like all the others, but has no server requirements. The samples use an ASP-based image browser, but you can replace that with a PHP version or you can run ASP on another platform, if supported by your host. I have rewritten the image browser myself to suit different purposes. It's also realistic to use it without the image browser.
Sun 12 Oct | Herbert Sitz | Frederic -- I don't think it uses XML-RPC, but are you talking about an application like this one?: http://www.powerblog.net/
Sun 12 Oct | Simon Lucy | {cough} try http://www.objective2k.com/AccessEdit
Sun 12 Oct | David Walker | For another WYSIWYG XHTML-compliant editing component with an appealing price (free) and an awfulk name, try fckeditor at http://www.fredck.com/FCKeditor/
Sun 12 Oct | Frederic Faure | Thx a bunch everyone :-) Don't know if TTW editing is good enough for anything longer than a few paragraphs, in which case I'll have to look into some dedicated app. Thx again for the tips.
Sun 12 Oct | fool for python | Mozilla/Firebird also has an good editor (Mozile). For a good web based CMS check out Plone.
Sun 12 Oct | Frederic Faure | Thx. I know about Zope and Plone, but I was looking for a simple way to input rich text. Textarea just isn't the best word processor around :-) I guess Mozile is just a regular WYSIWYG HTML editor? In that case, we're back to square one: How to allow users to add contents through a CMS, ie. no hard-coded HTML.
Unicode and VB6 | Sat 11 Oct | Ube Jega
Thank you for the enlightening article on Unicode and Character Sets. Inspired by the article, I tried to copy and paste samples from many languages in my VB6 application text boxes; the western European samples work. For all other languages, such as Hebrew and Russian all I get it a pile of vertical bars. I wonder if a user using a localized Windows version would get the same result. I look for some hints on this on the web and I found this link: http://support.microsoft.com/default.aspx?scid=http://support.microsoft.com:80/support/kb/articles/q193/5/40.asp&NoWebContent=1
Sat 11 Oct | Brad Wilson | What font are you using? Try "Arial Unicode MS".
Sat 11 Oct | Dave B. | The native Visual Basic Controls are not Unicode enabled. You need to use Unicode enabled controls. Microsoft provides the Microsoft Forms Controls for this purpose. The 'fm20.dll' is the DLL that contains these controls. Simply add a reference to it from the VB IDE. These controls are used by MS-Office/VBA and apparently are not distributable with your application. Microsoft says this: 'The Fm20.dll is NOT redistributable. You must have an application such as Microsoft Office 97 on the target system that installs Fm20.dll as part of its setup. (Fm20.dll is included with the OSR2 and OSR2.5 releases of Windows 95.) You can also find this file on the Visual Basic 5.0 CD under the \TOOLS\DataTool\Datatool\Msdesign folder. This will be installed only if you run the setup for the Visual Database Tools. In any case, you may not distribute the Fm20.dll as part of your setup, even if you purchase the Microsoft Office Developer Edition product.' Download this to get 'fm20.dll': http://msdn.microsoft.com/workshop/misc/cpad You may also simply type, 'unicode textbox control' in the MSDN search box. It brings up a lot of useful information.
Sat 11 Oct | Dave B. | I see you found it.  Didn't follow your link. Sorry.
Sat 11 Oct | ub | Thank you. Our application will still target our intended audience, but for sake of completness it would be nice to support all countries/languages. I'll try the Arial unicode font. Regarding the dll, it is interesting, but the re-distribution is an issue; we can't expect our customers to have Office97 in each box.
Sat 11 Oct | Joel Spolsky | As far as I know Arial Unicode is also a part of MS Office and not redistributable. Probably the best way to get Unicode support on VB6 forms is to wrap the appropriate native operating system controls yourself in ActiveX controls. Otherwise this remains a very good reason to upgrade from VB6 to VB.Net.
Sun 12 Oct | Andy Norman | If you are dealing with Unicode in VB6, don't forget the excellent 'Internationalization With Visual Basic' by Michael S. Kaplan http://www.amazon.co.uk/exec/obidos/ASIN/0672319772/normancx-21
Magazines - How do you store them ? | Sat 11 Oct | Fairlight
How do you store your magasine ? I got a bunch of mags and right now I do not know if I should keep or throw them away. I just wonders if there are might be some articles that I need to refer to in a couple of years. What do you do with your mags ? do you keep them or lets say you throw them away every 3 years?
Sat 11 Oct | Mike | Throw them, you won't miss them.  Or at least I haven't.  Especially technology mags get outdated so fast.
Sat 11 Oct | . | I throw them away without even reading them.  Most of the time I don't even take the plastic off of 'em -- they go straight into the trash.  I don't even recycle them.
Sat 11 Oct | Troy King | In Adobe Acrobat. I find the articles I want, cut them out at the binding with a razor blade, and scan them into Acrobat. It only takes a few minutes per article, and I get to keep what I want with no physical storage space.
Sat 11 Oct | Mark Bessey | Many of my favorite tech magazines regularly release CD-ROM compilations of back issues. You'll most likely never remember which issue a particular article was in, anyway. Just put the magazines on a shelf for a year or two, then buy the latest compilation CD when it comes out. Optional next step: try really hard to find a library that's interested in your old magazines, then get frustrated and recycle them or sell them on eBay. For example, both Doctor Dobb's and Embedded Programming are available on CD at http://store.yahoo.com/ddjcdroms/ Actually, looks like DDJ has recently started a subscription + online access program where you can get access to the entire archive online with a subscription. I might need to look into that. -Mark
Sat 11 Oct | Darren Collins | Similar to Troy, I use my digital camera to photograph the pages I'd like to keep, then throw the magazine away or give it to someone else who might be interested. It's quicker than cutting and scanning.
Sat 11 Oct | Enlarge | Actually, I had to trash all my Wired, Embedded Systems,  Circuit Cellar, Scientific American and Technology Review magazines since the piles were collapsing on me. Also, the walk-in closet and bookshelf turned unusable.
Sat 11 Oct | Robert Jacobson | At the risk of stating the obvious, many magazines keep searchable archives of their old articles on their web sites.  It's often easier to find an old article through a search engine then by hunting through stacks of old magazines.
Sun 12 Oct | Unfocused Focused | Digitally - I scan in my magazines using a sheet feeding scanner - Put them in one side, and then come back later and scan the other for the stack. Then I THROW AWAY the original. I'm a packrat by nature, so I have to keep myself on track with this. I store them as images and OCR'ed on a DVD and I'm still working on eactly the best way to organize them.
Sun 12 Oct | Troy King | It took me a whole night to scan in my first stack once I'd decided on that storage method, but now it's easy to keep up since they just trickle in monthly. That first pile took a while to get in, though.
Sun 12 Oct | Fairlight | For those who take pictures of their mags, this is actually a neat idea. I got a Sony DSC75, do you think that's good enough to take a readable picture of the article ? Once you've taken a shot of the article, it becomes a .JPG file, ideally what I would love to do is perform a string search on those articles. Is there a way to transorm a .JPG into a .PDF file or a .DOC file ? I don't how far software have gone into typesettings recognition when it's embedded in a Jpeg File
Sun 12 Oct | Chris | Office XP and 2003 both come with an OCR program that should be able to OCR a JPEG file. Although it is a bit lacking in the accuracy department.
Sun 12 Oct | Bella | yup, toss em. now, its all online pre-internet, I used to have piles and piles of em. I even made an access database with all the topic keywords (optimization, etc) for the main articles in each issue, so I could easily search for and refer back to an article .. gosh darn, I cant help but laugh when I think about that. I was so into my programming career, its not even funny.
Sun 12 Oct | Mickey Petersen | But make sure you keep a few of the old ones around. 10 years from now it'll be fun to see how their predictions were true/untrue and how tech's changed since then. It encourages nostalgia and makes you feel older.
Sun 12 Oct | Unfocused Focused | Well, I'm working my way back to the old ones - I've got a Compute! in the queue at some point. Anyone know of a demand for Apple IIe basic code listings? :-) (I still don't know if I'm going to have the heart to destroy my old Amiga RKM's - that IS meorabilia more than information.)
Sun 12 Oct | pb | Toss 'em. Those reporting that they scan or photo them have got to be kidding.
Sun 12 Oct | Troy King | pb - not kidding at all. I don't scan the whole magazine. I just scan the articles I want. It's usually just one or two articles per mag, and even cheap scanners (I think this one cost $120) scan pretty quickly in full-color mode. The resulting Acrobat files run 2 - 5M on average. They can also be OCRed automatically by Acrobat for content indexing, but I almost never go that far.
character encoding questions | Sat 11 Oct | BY
Hi, Ive a few questions regarding the latest article on character encoding issues. 1. Does the concept of code pages apply to Unicode code points. Do different languages use the same code point for different characters like 8-bit ASCII code pages? For example 0xABCD might be one character in Japanese but another in Chinese ? Has this happened ? 2. I can understand simple encodings like UTF-8, UCS-2, but what are those language encoding in IE (in the View->Encoding menu) ? 3. How do font files fit in the picture ?
Sat 11 Oct | Brad Wilson | 1. No. No. No. No. :) Code points are unique. Code pages are a completely different deal (they're different ways of slicing up the limited character space, pre-Unicode). 2. Those are code pages. 3. Some fonts are Unicode compliant (like Arial Unicode MS, I think it's called), but most are not. They are designed to be used with specific code pages. Unicode is much simpler than the code page mess.
Sat 11 Oct | Frederic Faure | By "compliant", I guess you mean that some fonts hold signs (don't know the correct term. "Rendering"?) for the entire set, while "non-compliant" means that this font only contains a sub-set, eg.  a Unicode font sold in Japan would only contain the English + Japanese signs, nothing more.
Sat 11 Oct | Brad Wilson | Yeah, I was simplifying some. There are very few fonts that contain glyphs for all the known Unicode code points (at least, whatever version of Unicode is supported by the NT-kernel Windows OSes).
Sat 11 Oct | Ori Berger | Code points are unique, but characters are not; Some characters are repeated several times for completeness. Thus, it's possible to encode a string which would have the same visual representation (and essentially same human interpretation) in more than one way. The hebrew letter 'Aleph', (which looks like 'א', hopefully your browser and Joel's ASP script will collaborate to display this properly) is also a mathematical symbol representing the power of the continuum. It has two codes - one in the math symbol area, and one in the hebrew area. The Russian character 'C' is actually associated with the sound that 'S' makes in English; And Russian 'P' is associated with the sound that 'R' makes in English. Both 'C' and 'P' (and many others) are repeated in the Russian area of the unicode set, even though there is no visual difference. Unicode also has several normalized forms and many de-normalized forms. For example, an 'o' with an umlaut (two small dots above it) can be represented as a precomposed character (having one code), and as a composition - umlaut (a code of its own) + o (a code of its own). This could lead to very subtle bugs - e.g., a user saves a file called 'CP' with both characters being from the Russian set; Later, when she tries to open them, then she can select them from explorer, but can't open the file by name. Furthermore, the sort order in explorer will look wrong. Or a 'find' feature in an editor will look for the string that was typed, precomposed, and silently ignore the decomposed string, even though they are visually the same and have the same meaning. Proper support for Unicode is extremely hard - and it's not because of the spec, but rather because of the many details of the languages that need to be taken care of.
Sun 12 Oct | Ankur | There are also sometimes many ways to get the same character - for example, Unicode mostly follows Latin-1 for the code points from 128-255, so there are alot of legacy accented characters in that region.  However, Unicode has a separate way of accenting *any* character by combining any letter with an "accent conjugate", using the "Combining Diacritical Marks" section of the BMP.  Any Unicode compliant font is supposed to know that when that the accent follows the letter, it's supposed to combine them into a single character.  So, you can get á with U+00E1 or by using U+0061 followed by U+0301.  The latter is more general, the former is currently more common.  You application has to treat both ocurrences the same.
Story about fun with character sets | Sat 11 Oct | GP
I was working at a company where the main product was written in Visual FoxPro. The settings data that was stored in the Database was encoded using a simple (and very crackable) encryption scheme. The encryption scheme itself was written to shift the characters around in a specific algorithm. The only problem is that the characters were converted into ASCII before doing the shift. So along I come and I have to write Webapp using Java/JSP that needs to pull data from the database that is encrypted using this scheme. The key problem with porting the encoding algorithm is since the algorithm works with ascii characters, they are going to wrap at the end if you shift the characters over enough, and so you have to mimic this in a Language where Unicode is the base String implementation. Without getting into the details I figured it out, but it goes to show that you really need to consider your character set even when you are not doing internationalization. As a little aside, you can imagine my suprize (or lack thereof) when we tried to make the FoxPro Application work with a Unix version of Oracle and non of the encryption work anymore.
Sun 12 Oct | JX | Languages such as Java and C# should implement 2 types of strings: ASCII strings and Unicode strings. People speaking other languages should also switch to English, and in time, all the people should speak only English. I am saying this as a non-native English speaker. The advantage of having all the people all over the world speaking only one language is far greater than the disadvantages.
Sun 12 Oct | runtime | and all those Mac and Linux users should switch to Windows. The advantage of having all the people all over the world using only one operating system is far greater than the disadvantages.
The Language FOrmerly known as Lingo... | Sat 11 Oct | Bill Rayer
is now known as: - Visual Fred - Cola - Ubercola - Ubercode - Ubersharp Any preferences? I was thinking of Visual Fred - the language that is almost but not entirely unlike Visual Basic.
Sat 11 Oct | John Topley (www.johntopley.com) | My preference is for Cola is you can get away with it. I wouldn't be able to take anything called Visual Fred seriously.
Sat 11 Oct | Bill Rayer | If I use 'cola', I may still get the same background radiation from the loyal defenders of other company's trademarks: 'did you know Cola Cola is a trademark' etc.
Sat 11 Oct | Eric Debois | Cola should be safe. No risk of confusion. And theres Many brands that use the word Cola. Calling your language Pepsi Cola wouldnt work though :D
Sat 11 Oct | no name | Is there a market for this language of yours Bill?  Or this just a hobby you are persuing?
Sat 11 Oct | Almost Anonymous | Cola is a great name for a programming language...  I just wish I had thought of it first! 
Sat 11 Oct | Philo | The problem with 'Lingo' was that there is an existing programming language with the exact same name. 'Cola' is generic for caramel-colored soda pop, and as far as I know there's nothing in the IT realm with the name. Philo
Sat 11 Oct | Stephen Jones | I prefer Visual Cola
Sat 11 Oct | Philo | I second "Visual Cola"
Sat 11 Oct | FireMode | The "Visual" prefix will be old-fashioned in 1 or 2 years. Why using it?
Sat 11 Oct | sgf | Visual is bogus anyway. Is there *any* language in which you can *really* programm visually i.e. drag and drop icons, no text involved? If there were would you consider it a language, or using it to be programming? Just a random rant against another buzzword......
Sat 11 Oct | Philo | 'Object Cola'? What's the next big thing? Anyone got any guesses? Philo
Sat 11 Oct | Brad Wilson | Cola.NET, of course!
Sat 11 Oct | Dennis Atkins | Visual comes after text based. So what after Visual? Psychic! Introducing 'Psychik Cola -- the Language that Knows what you want to Code before you do! (tm)' 'Where will you go today? Psychick Cola knows the answer! (tm)' -- So, Bill... you, er... heard from Macromedia or what?
Sat 11 Oct | Dennis Atkins | And the oracle of google informs us that there is already a computer programming language called CoLa: http://citeseer.nj.nec.com/hirsbrunner94cola.html This could be tricky. No psychic cola though!
Sat 11 Oct | Dennis Atkins | Oh dear, and here is a second programming language (I think) called cola: http://cvs.perl.org/cgi/cvsweb.cgi/parrot/languages/cola/ If no Cola. Maybe Whopper? Or Fries? Shake? Earthquake? Dingo? Kangaroo?
Sat 11 Oct | Philo | How about Wide Area Language for Multitasking Applications in Real Time ? Surely that's available? Philo
Sat 11 Oct | Can't compete with MS | Don't let it get you down. The product I'm working on was originally supposed to be called Data Analyzer. Guess who took that name? Our backup name was Infopath - guess who also took that name?
Sun 12 Oct | Simon Lucy | Code Cola
Sun 12 Oct | Steven Pinker | How about 'Jargon'?  Or maybe 'deep structures in the universal grammar parse tree'?
Sun 12 Oct | John Topley (www.johntopley.com) | Jargon's quite a good name but then I thought..."Jargo"!
Sun 12 Oct | Bleh | sgf: Have you seen the tools that come with mindstorm? You basically link together coloured blocks, some of which can take values.. It's very neat (:
Sun 12 Oct | FireMode | Raylang (RAYer LANGuage): some narcissim won't hurt!
Step #231 in the spam escalation wars | Sat 11 Oct | Simon Lucy
I use a Bayesian filter for all my email to tag spam and divert it from my usual attention and on the whole it is working well. In the last few days Ive noticed a type of email getting through that may be difficult to filter. Its a multipart mime mail and the plain text alternate is: never as yet published in full, only abstracted in the Origin. despotic elements retained by the conquered nations as yet only The HTML component, which Id never normally see as I only read plain text mail, had an advert for selling drugs online and at the bottom was when occupying Yang-p`ing and about to be attacked by Ssu-ma I, I could hardly repress a shuddering recoil as he came, bending amiably, these differences implied in itself a political classification. A So it looks like theyre trying to poison the statistical filter by increasing the use of contextually irrelevant but general words so that the statistical score is likely to be similar to acceptable email. This one fails to a degree because the subject line is mangled and obviously spam Che_ck out ou-r se,lection (of gre=at RX mp_xsdjjd So, now Im thinking of counter methods, since others will generate mail with that other filter avoidance technique the entirely irrelevant but superficially reasonable subject line. I still dont buy the boil-the-ocean solution of replacing SMTP since its not the delivery method but the content thats poisoned. Instead, I suppose the same model as that used in the Cold War will be applied, small incremental improvements on both sides to counter the previous improvement, falling back occasionally to more primitive methods. I can certainly look to tailoring the filtering so that it treats different components separately, scoring both the plain text and the HTML and choosing the a particular bucket when the score is different. Or applying a vocabulary checker to see if the same words are used in both components. The latter would work for me since if the plain text just says read the HTML Ill ignore it anyway.
Sat 11 Oct | Brad Wilson | One word: SpamAssassin.
Sat 11 Oct | Brad Wilson | Oh, okay, that was just dramatic. :) If you're a Windows user, and your e-mail comes in via POP3, then SAproxy is what you want. Fully enclosed copy of SpamAssassin that masquerades as a POP3 proxy. I was using it while my mail host was fixing their broken SpamAssassin install. Brilliant, it is. :)
Sat 11 Oct | Simon Lucy | Well yes, though its pretty much equivalent to what I already use, combining Bayesian filters and header analysis.  So its going to have to handle exactly the same thing.
Sat 11 Oct | Hardware Guy | My ISP offers SpamAssassin, and it *is* very good.  But the nonsense-text spams mentioned above have been leaking through to an alarming degree over the last week or two.
Sat 11 Oct | Philo | By the way, has everyone else's (in the US) phone stopped ringing? I don't think I've heard from a telemarketer all week. [pleasant sigh] Philo
Sun 12 Oct | no name | It got to the stage at work where I would physically disconnect the phone because it was ringing so much. It was only fax machines as well *sigh*  Recently it got better, so now I just pick up, it bleeps, and then I lay the headset on the desk. It must cost idiots who can't work fax machines a shed load of money.
Sun 12 Oct | no name | This has been about for a while on usenet, Simon. Step 232 is something which follows that thing about humans being able to read words even when they are misspelt as long as the beginning and end is correct. Not that it helps when I spamcop them, but I'm sure it makes them feel better, even if it makes them appear illiterate.
Sun 12 Oct | Bill Godfrey | 'The HTML component' Your scanner continued after this point? I kill on there merest sign of HTML in an email. Maybe I'm just strange.
Sun 12 Oct | Chris Nahr | Maybe you don't communicate with very many people.  Lots of normal people, i.e. those who aren't programmers, use HTML email as a matter of course...
Sun 12 Oct | as | I don't think Bill's strange, but maybe automatic deletion is a bit extreme. I filter all HTML mail into a separate mailbox, which I find makes it easier to spot the very rare good messages. It's true that a lot of people do send HTML mails, but often because it's turned on by default, not because they use the formatting features, and everyone I know has been quite happy to turn HTML off when asked.
PHP, Unicode .. other solutions! | Sat 11 Oct | Jonas B.
PHP has other problems than Unicode. There are very common security issues with many of the PHP scripts you can download everywhere. Basically these are mostly either not sanitizing data or failing to initialize variables properly. Plus it scales to ugly beasts of programs. Just look at whatever you find on Sourgeforge. They say PHP5 will solve all this. To me its just vaporware and no one knows if itll improve things yet. There are plenty other good web solutions. Perl was the web king before PHP and ASP and it still offers some tremendous improvements over PHP. Itll run your scripts much faster (which translates to more simultaneous users), it has a tainted mode for data which is a great help, and several mature web templating systems. Plus it handles Unicode well. Please check out Mason(hq.com) for very straightforward templating, rather PHP-like. For something with a little more steep learning curve but which scales very beautifully for larger codebases, look at Axkit(.org). PHP was intended for Personal HomePages, and itll take a yet a couple of years before it matures beyond it architecturally. Im sure it will, however, since Yahoo puts resources in it. But for now, Ill stick to Perl.
Sat 11 Oct | Tom (a programmer) | Am I the only person who read Joel's message, 'When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.' and thought, 'Wow, Joel's going to fix PHP!'?
Sat 11 Oct | Simon Lucy | Ummm, very likely you were.
Sat 11 Oct | Brad Wilson | Especially since ASP.NET is build on .NET, whose strings are all internally UTF-16. :)
Sat 11 Oct | JX | I am a non-English speaker. Frankly, I belive that everybody should switch to English. Unicode is a bloody horror! :-(
Sat 11 Oct | Brad Wilson | Code pages are a bloody horror. Unicode is tolerable.
Sun 12 Oct | Damien Connolly | Separate your content from your html! Store your content somewhere as ( UTF-8 / UTF-16 / Favoured Encoding ) and then *translate* it to whatever ugly glob of gunk it needs to be for ( Browser-X ) to see it. HTML is just a format for rendering out stuff to a bunch of incompatible inconsistent rendering engines. I am sick of people writing it talking about and viewing it as some kind of language! Imagine if discussions went on like this about EPS or .TEX files. Let the machines make the HTML and lets just manage the content because the content is really what matters.
Sun 12 Oct | Alex | Frankly, I'm not starting any new work using anything but utf-8. It just makes the whole mess so much easier to handle. Like for string, sanitizing a string based on regular expression and POSIX Character classes is a snap. [:alpha:] expands to letters wether it's A, ß or ç. And everything else just sort of works so long as all your functions/objects are all utf-8 aware. By the way, for those stuck with a lot of legacy PHP code (like me ;-() the mbstring module allows you to overload a lot of the text handling functions from regular expressions to strlen(). You still have to deal with nasty bits of code, but you won't get screwed by a forgotten strlen() not converted to a multi-byte aware version.
Headers and charsets | Sat 11 Oct | Philippe
Great article on charsets etc. But why is this this that the article has Content-Type *after* things like title ? It is not the case in the first brazilian article as you can see. Was there a change between 2.0.18 and 2.0.20 ? Or isnt the whole site generated from the same file ? Joel on Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)