Webword Statistics - What the elves did in the night
Dear John,

I have been reading (and recently posting) on webword for a few years now, and have visited the site a lot more since you started allowing people to easily add comments of their own.

WebWord is now one of very few sites that I visit regularly (over 3 times a week) and spend a couple of hours a week in the site.

However over the past few weeks I have found it harder and harder to 'keep up' as the number of items and comments keep increasing.

It's not too difficult to keep up with your links, as I can just work my way through the home page (although its a bit more difficult if I've been off-line for more than a few days).

The thing is this: After I have read something you have linked to or written, I like to read comments that have been added since I last looked at the site. I found myself trying to memorise the number of comments on each item, but that soon confused me.

I have spent a few days this week writing a program for my personal use, that gives me a list of items and comments that have been published since I last looked at the site. I then use these generated links to jump straight into the page for each item, rather than haphazardly browsing the home page. I am still accessing the site through the homepage, but less often than I was.

After getting this basic piece of functionality working, I decided to try and see if I could produce some interesting stats from the data I had in my database.

To that effect I have produced a few reports that give monthly breakdowns of figures for the number of items published, the number of comments posted, the number of known users posting messages. As well as the top posters for each month and the ten most commented on items for each month.

Chris McEvoy

28 June 2002

How it works
In simple terms I am taking a snapshot of the Webword site and then processing the data to produce my list of links and statistics HTML pages.

When I run my VB application  it does the following :
  • Find the ID of the last LogItem  that was checked by the program.
  • Take 50 away from this number, so that it is checking the 50 most recent items, as well as any new ones that it finds.
  • It loads the item by loading the browser using the generated URL and then saving the HTML into a local file.
  • It knows when it has finished when it finds it cannot load 5 items in a row. (There are a couple of gaps in the IDs, so I though 5 would be a safe check)
  • If the LogItem is already in the databse, then it will check if the length of the item has changed. If it has, it has been updated, if not then it's still the same.
  • New and updated items are marked in tblWeblogs that they need processing.
  • When it runs out of items, it starts the deconstruction of the local HTML data.
  • The HTML LogItems are broken down into their constituent parts, by searching for specific tags or text in the data. Some aspects of the HTML have changed since April so it needs to check for a few different situations.
  • The HTML is broken down into the Date Posted, Title, URL, Description and John's comment on the item.
  • It then looks at the comments posted by the members, and breaks each comment into Date/Time Posted, Comment Text and the User Id.
  • As people can change the name and URL that they use for posting, I have to do some extra work to try and identify users, part of this is automatic and the final check is manual. For example Jack Schonchin has used 52 different versions of his name and URL (so far). 
  • Once, all of the new or updated items have been broken down, the data is ready to be used.
  • The diagram below, shows the table structure used to model the system. The lines represent the relationships between the tables.


How the data is used
I can query the database to find the comments that have been added since I last looked at the system:

SELECT tblComments.When, tblUniqueUsers.Name, tblLogDetail.LogTitle, tblComments.Comment, tblComments.LogID, tblUniqueUsers.UniqueID
FROM tblUniqueUsers INNER JOIN (tblUsers INNER JOIN (tblComments INNER JOIN tblLogDetail ON tblComments.LogID = tblLogDetail.LogID) ON tblUsers.UserID = tblComments.UserID) ON tblUniqueUsers.UniqueID = tblUsers.UniqueID
ORDER BY tblComments.When DESC



Using this data I can generate a list of links to new items and new comments that I have not yet seen.

The application also generates a set of static HTMl files that contain useful and interesting statistics about items and comments.