Category management Part 4: Learning how to read

Dec 29, 2006

So now we have a two tables, one full of category names and ids and one which relates those ids to left and right values which supposedly match up to positions on a tree. This is generally the part where most people start to cry about difficult math and SQL queries... but in Celko we trust my children so let's go over the queries we'll be using.

Now, before I get started, I would like to take a moment to pitch Celko's book, "Trees and Hierarchies In SQL for Smarties" again. Those who have read the book may point outy that Celko mentions MySQL as a poor choice of database. This was true when he wrote it. However, everything he wrote can now be entered directly into MySQL 5. Most of the queries below are slightly modified versions of queries in that book with the goal of applying them to a category management system. Problem is that there is a wealth of information in there that has nothing to do with the topic of these articles - so go pick it up and read it.

No really... it's the same information

Normalization allows you to eliminate repetitive information and create wickedly complicated table structures that take extremely complicated queries to read in. This was a problem before MySQL 5 - but now we have views, The first thing we need to do is set up a quick view so that we can see tree with the category names for each node.

CREATE VIEW site AS (
SELECT * from tree JOIN category USING (cate_id) ORDER BY tree_left
);

I ordered the results by tree_left to make life easier while debugging. Now you'll be able to see the order your worm will crawl as it moves through the table. Incidentally it might be worth your time to check your table to make sure the left and right values are correct before continuing. Go ahead and select * on your view and make sure your left and right values match up the way they should. If you were using my test data, you should see something like this:

+---------+---------+-----------+------------+---------------+
| cate_id | tree_id | tree_left | tree_right | cate_name     |
+---------+---------+-----------+------------+---------------+
|       1 |       2 |         1 |         54 | Home          | 
|       2 |       1 |         2 |         35 | North America | 
|       3 |       3 |         3 |         10 | Mammal        | 
|       4 |       5 |         4 |          5 | Fox           | 
|       5 |       6 |         6 |          7 | Deer          | 
|       6 |       7 |         8 |          9 | Buffalo       | 
|       7 |       8 |        11 |         22 | Fish          | 
|       8 |      10 |        12 |         17 | Salt Water    | 
|       9 |      12 |        13 |         14 | Sea Bass      | 
|      10 |      13 |        15 |         16 | Shark         | 
|      11 |      15 |        18 |         21 | Fresh Water   | 
|      12 |      17 |        19 |         20 | Trout         | 
|      13 |      18 |        23 |         34 | Insect        | 
|      14 |      20 |        24 |         33 | Winged        | 
|      15 |      21 |        25 |         30 | Bee           | 
|      16 |      22 |        26 |         27 | Honey         | 
|      17 |      23 |        28 |         29 | Killer        | 
|      18 |      24 |        31 |         32 | Fly           | 
|      20 |      26 |        36 |         53 | Africa        | 
|       3 |       4 |        37 |         40 | Mammal        | 
|      19 |      25 |        38 |         39 | Big Five      | 
|       7 |       9 |        41 |         50 | Fish          | 
|       8 |      11 |        42 |         47 | Salt Water    | 
|      21 |      27 |        43 |         44 | Clown Fish    | 
|      10 |      14 |        45 |         46 | Shark         | 
|      11 |      16 |        48 |         49 | Fresh Water   | 
|      13 |      19 |        51 |         52 | Insect        | 
+---------+---------+-----------+------------+---------------+
27 rows in set (0.05 sec)

Right off the bat now we have the ability to run numerous queries which will allow us easy access to the information contained in our tree. Every time we add a child category to a parent category, the right value will increase by 2.. As such the difference between the left and right value of a given node will tell us how many children there are. We can also find all the child nodes in the table by finding all the nodes where the left value is one less than the right.

SELECT tree_id, cate_id, cate_name FROM site WHERE tree_left = (tree_right - 1);

+---------+---------+-------------+
| tree_id | cate_id | cate_name   |
+---------+---------+-------------+
|       5 |       4 | Fox         | 
|       6 |       5 | Deer        | 
|       7 |       6 | Buffalo     | 
|      12 |       9 | Sea Bass    | 
|      13 |      10 | Shark       | 
|      17 |      12 | Trout       | 
|      22 |      16 | Honey       | 
|      23 |      17 | Killer      | 
|      24 |      18 | Fly         | 
|      25 |      19 | Big Five    | 
|      27 |      21 | Clown Fish  | 
|      14 |      10 | Shark       | 
|      16 |      11 | Fresh Water | 
|      19 |      13 | Insect      | 
+---------+---------+-------------+
14 rows in set (0.04 sec)

This gives us the ability to find all the nodes underneath a parent very quickly by finding nodes with a left value higher than, and a right value lower than a given node. In our case, this gives us two options - either searching by a single position in the tree (by tree_id):

SELECT c.tree_id as ctree_id, c.cate_id AS ccate_id, c.cate_name AS ccate_name
FROM site AS p, site as c
WHERE c.tree_left > p.tree_left
AND c.tree_right < p.tree_right
AND p.tree_id = '9'
ORDER BY p.tree_left, c.tree_left;

+----------+----------+-------------+
| ctree_id | ccate_id | ccate_name  |
+----------+----------+-------------+
|       11 |        8 | Salt Water  | 
|       27 |       21 | Clown Fish  | 
|       14 |       10 | Shark       | 
|       16 |       11 | Fresh Water | 
+----------+----------+-------------+
4 rows in set (0.07 sec)

or by all the appearances of a category in the tree (by cate_id). In this case we need to add a group by cate_id because if the parent category appears more than once in the tree, then the child category may as well.

SELECT c.tree_id as ctree_id, c.cate_id AS ccate_id, c.cate_name AS ccate_name
FROM site AS p, site as c
WHERE c.tree_left > p.tree_left
AND c.tree_right < p.tree_right
AND p.cate_id = '7'
GROUP BY c.cate_id
ORDER BY p.tree_left, c.tree_left;

+----------+----------+-------------+
| ctree_id | ccate_id | ccate_name  |
+----------+----------+-------------+
|       10 |        8 | Salt Water  | 
|       12 |        9 | Sea Bass    | 
|       13 |       10 | Shark       | 
|       15 |       11 | Fresh Water | 
|       17 |       12 | Trout       | 
|       27 |       21 | Clown Fish  | 
+----------+----------+-------------+
6 rows in set (0.02 sec)

Thinking inside the box

Each node in the tree is now acting like a box. The wider the box (difference between left and right values) the more little boxes it can contain. The interesting thing about this is that the left values of the table will always fall between the left and right values of the parent nodes. As such, we can also quickly find positional information for each node in the tree.

One problem with using the adjacency list method for a hierarchy is it is very expensive to find out how high on the tree a given node is. Without that knowledge, you often have to set a limit to how deep your pages go. The box property of MPTT makes this trivial. All we need is to search for all nodes where left value of your given node is between the left and right values. The following query presumes home should be the zero level (useful if you're using the result of this in an array.)

SELECT thisId.tree_id, thisId.cate_id, thisId.cate_name, (COUNT(quan.tree_id) - 1) AS tree_level
FROM site AS quan, site AS thisId
WHERE thisId.tree_left BETWEEN quan.tree_left AND quan.tree_right
GROUP BY thisId.tree_id;

+---------+---------+---------------+------------+
| tree_id | cate_id | cate_name     | tree_level |
+---------+---------+---------------+------------+
|       1 |       2 | North America |          1 | 
|       2 |       1 | Home          |          0 | 
|       3 |       3 | Mammal        |          2 | 
|       4 |       3 | Mammal        |          2 | 
|       5 |       4 | Fox           |          3 | 
|       6 |       5 | Deer          |          3 | 
|       7 |       6 | Buffalo       |          3 | 
|       8 |       7 | Fish          |          2 | 
|       9 |       7 | Fish          |          2 | 
|      10 |       8 | Salt Water    |          3 | 
|      11 |       8 | Salt Water    |          3 | 
|      12 |       9 | Sea Bass      |          4 | 
|      13 |      10 | Shark         |          4 | 
|      14 |      10 | Shark         |          4 | 
|      15 |      11 | Fresh Water   |          3 | 
|      16 |      11 | Fresh Water   |          3 | 
|      17 |      12 | Trout         |          4 | 
|      18 |      13 | Insect        |          2 | 
|      19 |      13 | Insect        |          2 | 
|      20 |      14 | Winged        |          3 | 
|      21 |      15 | Bee           |          4 | 
|      22 |      16 | Honey         |          5 | 
|      23 |      17 | Killer        |          5 | 
|      24 |      18 | Fly           |          4 | 
|      25 |      19 | Big Five      |          3 | 
|      26 |      20 | Africa        |          1 | 
|      27 |      21 | Clown Fish    |          4 | 
+---------+---------+---------------+------------+
27 rows in set (0.02 sec)

This query can also give us the height of the tree.

SELECT MAX(tree_level) AS tree_height
FROM (SELECT thisId.tree_id, thisId.cate_id, thisId.cate_name, (COUNT(quan.tree_id) - 1) AS tree_level
FROM site AS quan, site AS thisId
WHERE thisId.tree_left BETWEEN quan.tree_left AND quan.tree_right
GROUP BY thisId.tree_id) AS mcelaney;

+-------------+
| tree_height |
+-------------+
|           6 | 
+-------------+
1 row in set (0.01 sec)

Now one of the first things I showed you was that we could find the children of a node... but the box model can be used to find the immediate children. The query here checks to see if the child node is in the sub query - if so then it skips it. To get the results for an individual node, add WHERE tree_id = 'x'. Keep in mind that we could also run this query against the cate_id to get the direct children of that category across the entire tree.

SELECT parent.tree_id, parent.cate_name, parent.tree_left,
child.tree_id, child.cate_name, child.tree_left, child.tree_right
FROM site AS parent, site AS child
WHERE child.tree_left BETWEEN parent.tree_left AND parent.tree_right
AND child.tree_id != parent.tree_id
AND parent.tree_id = '1'
AND NOT EXISTS (
SELECT *
FROM site AS middle
WHERE middle.tree_left BETWEEN parent.tree_left AND parent.tree_right
AND child.tree_left BETWEEN middle.tree_left AND middle.tree_right
AND middle.tree_id NOT IN(child.tree_id, parent.tree_id))
ORDER BY parent.tree_left;

+---------+---------------+---------+-----------+-----------+------------+
| tree_id | cate_name     | tree_id | cate_name | tree_left | tree_right |
+---------+---------------+---------+-----------+-----------+------------+
|       1 | North America |       3 | Mammal    |         3 |         10 | 
|       1 | North America |       8 | Fish      |        11 |         22 | 
|       1 | North America |      18 | Insect    |        23 |         34 | 
+---------+---------------+---------+-----------+-----------+------------+
3 rows in set (0.03 sec)

If we turn this query into a view (hereafter referred to as direct_child... don't forget to pull out the AND parent.tree_id = '1') we can run a simply query against it to gain the direct parent of any node (or parents of any category).

SELECT ptree_id, pcate_id, pcate_name, MIN(tree_left) - 1, MAX(tree_right) + 1
FROM direct_child
WHERE tree_id = '5'
GROUP BY ptree_id;

So we can get the children, the direct children and the parents - and can create views to accomplish all these feats. As it turns out the hard one is to get all the siblings of a given node. We need a view to give the level of each node in the tree. Should look like the last except we're not going to subtract one from the level:

CREATE VIEW siblings_level AS
SELECT COUNT(A2.tree_left) AS lvl, A1.tree_id, A1.cate_id, A1.cate_name, A1.tree_left, A1.tree_right
FROM site AS A1, site AS A2
WHERE A1.tree_left BETWEEN A2.tree_left AND A2.tree_right
GROUP BY A1.cate_name, A1.tree_left, A1.tree_right;

Then we can use the following query to snag the siblings:

SELECT DISTINCT S2.cate_name
FROM siblings_level AS S1, siblings_level AS S2
WHERE S1.tree_id = '8'
AND EXISTS
(SELECT *
FROM siblings_level AS S0
WHERE S1.tree_left BETWEEN S0.tree_left AND S0.tree_right
AND S2.tree_left BETWEEN S0.tree_left AND S0.tree_right
AND S0.lvl = S1.lvl - 1
AND S1.lvl = S2.lvl);

+-----------+
| cate_name |
+-----------+
| Fish      | 
| Insect    | 
| Mammal    | 
+-----------+
3 rows in set (0.04 sec)

Last query we have seems twisted, but it solves one of the most common recursive operations on websites - the breadcrumbs.

SELECT A2.tree_id, A2.cate_name,
(SELECT COUNT(*)
FROM site AS A4
WHERE A4.tree_left BETWEEN A1.tree_left and A1.tree_right
AND A2.tree_left BETWEEN A4.tree_left AND A4.tree_right) AS path_nbr
FROM site AS A1, site AS A2, site AS A3
WHERE A1.tree_id = '2' /*starting node*/
AND A3.tree_id = '23' /*ending node*/
AND A2.tree_left BETWEEN A1.tree_left AND A1.tree_right
AND A3.tree_left BETWEEN A2.tree_left AND A2.tree_right;

+---------+---------------+----------+
| tree_id | cate_name     | path_nbr |
+---------+---------------+----------+
|       2 | Home          |        1 | 
|       1 | North America |        2 | 
|      18 | Insect        |        3 | 
|      20 | Winged        |        4 | 
|      21 | Bee           |        5 | 
|      23 | Killer        |        6 | 
+---------+---------------+----------+
6 rows in set (0.10 sec)

Take a breather...

Granted, I know a lot of these queries need some work. Renaming results, ensuring tree and category ids make it into views, etc. Keep in mind that before this is all over many of these will be set into views, functions and stored procedures. That said, we've seen a lot structures which could give us some interesting properties. Since our queries can be performed on the category_id or the tree_id, our ability to find groupings of products would be extremely flexible.

Next up is the part that makes most people flinch... how to copy, insert and delete - then the real fun begins.

Part 3: What's under the hood?

Dec 22, 2006

One of the wonderfully frustrating things about my job is that I get to learn constantly. While I enjoy being able to learn, my lessons usually come from frustrating experiences. For instance, I learned that you need to use PHP's 'mysqli' function to connect to the database if you want to use a stored procedure. I learned this after about two and a half hours trying to make my first procedure work with PHP's 'mysql_connect.' I learned about PHP's 'ctype' and 'real_exception' functions about a week after wasting an evening building a bunch of functions to do the same thing in REGEX. I learned about the glory that is shell access to MySQL after months wasted time trying (and often failing) to push SQL files into databases through phpMyAdmin (if you've never had the experience... spare yourself.). On the MPTT project, I learned there is more to getting your table structures right than simply naming columns. So here's how I set up my tables for this project AFTER trying the wrong way. (twice).

Choosing a storage engine

I knew that MySQL utilized several different storage engines with many different features. Until now I defined this fact in my head as "use MyISAM if you need full text indexing and InnoDB for everything else." For this project I finally went and read the documentation on storage engines in the hopes of picking the best one off the bat (as opposed to my normal method of just using whatever is defaulted in MySQL). I had three main concerns going in:

  • Foreign Keys - Seriously, what is a relational database without a foreign key? If I'm not using foreign keys the database is only relational in my head. I've known for a while that you can't set a foreign key into a MyISAM table - hence why I don't use MyISAM often anymore (again, only when I need a full text index).

    With the MPTT project, one of the big fears would be orphaned information. If I can't assign foreign keys, I will have to build extra business logic to keep up. Since we're going to abstract the structure from the category information as I mentioned in my last post, foreign keys can be used for cascade updating and deleting of information. If not, we suddenly require a loop with an update statement in PHP (or a stored procedure in MySQL) to ensure that all instances of information falling under a given tree node are deleted and that the renumbering process is adjusted accordingly. In my tests (a five tier tree with 75 nodes and five categories which repeated at least twice) this approach was causing unduly long load times (27 years) during write operations.

  • Transactions - You can't perform a transaction on a MyISAM table. This is something I'm glad I found out before I spent an hour debugging under the assumption that the problem was due to a typing error. Because the tree traversal counts entirely on the integrity of the left and right values, it's a huge deal. If I start rewriting values and the process is interrupted, there is no way to roll back the values without a transaction. If that happens, I could end up having to reorder the tree nearly by hand. If the tree is small, maybe not a huge deal. But if we're talking a few hundred nodes (hell... a few dozen nodes) we're talking about an operation which would most certainly cut in on my bar hopping plans. In the words of The Dude, "This will not stand, ya know, this aggression will not stand, man."

    Long story short, I am all about the transactions now.

  • Locking Granularity - When you work with the database you have the option to lock either the table (MyISAM) or the row (InnoDB) depending on which storage engine you choose - this is locking granularity.

    Any time I add, move or delete a category I'll need to rewrite values on a large number of other records. In the case of a large hierarchy, the write process could take some time - and I don't want anyone reading from or (gasp) writing to the table until the process is finished. As such, the ability to do a table wide lock was originally high on the priority list.

    After getting into the project, however, I eventually decided that full table locking was a poor solution to my problem. After all, if you have an entire table locked for any amount of time on a high traffic site you'll degrade user performance. I work hard not to infringe the MySpace patent on crappy user experience for the masses.

So it's InnoDB

The astute reader will notice that I haven't even mentioned any of the other dozens of storage engines. There are a few reasons for this. First, I am trying to design a system here that is easily deployable. If I go with a storage engine that doesn't come with the prepackaged with the standard MySQL install, then deployability takes a hit. Second, my target use for this system is large web applications, but not so large that they approach the enterprise level. Most of the other storage engines that DO come standard with MySQL are designed to compete on some level with Oracle and trade off features for speed. I don't want to build any more PHP side business logic interfacing with the data layer than I have to... so these were out by default. That said, I hope to re-architect this project in the future for enterprise level applications - but that will most likely happen on an Oracle or MSSQL database which will give me the features and speed all at once.

All of that said, I am not completely thrilled with InnoDB. I wish the MySQL documentation did a better job of describing how InnoDB's row level locking is handled in a transaction. As of the time of writing I'm not sure if MySQL locks all the affected rows during a transaction or if it only locks on a row-by-row basis during the commit phases. None the less, the worst thing that I can conceive of happening would be for a tree to unravel and need to be reordered... and the easiest way I can see that happening would be if I go to renumbering nodes and something stopped the write from finishing. As such, the ability to roll back changes and handle errors is paramount in this project. With that in mind, InnoDB is a the obvious choice.

Choosing a character set

I don't think there is any argument that professional development requires scalability with I18n in mind. However, if I am married to the idea of PHP as my scripting language, I am left with a dilemma.

UTF-8 support in PHP 5 sucks.

Yea, I know. There are a million classes available that can handle multi-byte character strings out there... however none of them are particularly fast. This isn't a problem unless you plan on manipulating a metric butt-load of strings - but I kind of hope to. Normally I'd go with functionality over speed in this case, but PHP 6 is scheduled to come right out of the box with UTF-8 support built right in. I am willing to bet that the string functions baked into PHP 6 will run MUCH faster than anything I can add to the recipe now, so in the meantime I will develop on latin1 and plan a UTF-8 switch for whenever PHP 6 becomes available.

That said, there is absolutely no reason not to use UTF-8 as the character set in the database, so long as I filter my input to ensure that only Latin 1 characters make it in until PHP 6 comes out. From a planning perspective, this means that I need to abstract all of my non-binary safe PHP functions and set the character filtering on my inputs in such a way as to be able to turn off the ban on non-Latin characters in the future.

Enough already, Mac. . . write some code!

Normally I would give each record four pieces of meta data... who created it, who last modified it and when those two events occurred. For simplicity (re: because I am lazy), I will leave these columns out of the examples I show here.

First up is the tree table. We'll need, of course, columns for the left and right values for each node (keeping in mind that 'left' and 'right' are reserved words in MySQL). Since the category information will actually be held in a separate table, we'll need a column to hold the category id. There was a strong desire to set this as a foreign key to the category table, but I resisted. I don't want to drop a sub tree if a category gets deleted. It would be more useful to give the user the option to move the orphaned nodes instead. As such, we'll then set up the category id to allow nulls. This will prevent a completely orphaned branch in the event of an accidental node delete.

As such, my table looks something like this:

CREATE TABLE `tree` (
`tree_id` int( 11 ) NOT NULL AUTO_INCREMENT ,
`tree_left` int( 11 ) NOT NULL ,
`tree_right` int( 11 ) NOT NULL ,
`cate_id` int( 11 ) ,
PRIMARY KEY ( `tree_id` ),
INDEX(tree_left)
) ENGINE = INNODB DEFAULT CHARSET = utf8 COLLATE = utf8_unicode_ci;

We also need a table for the category information. This table only requires an auto-incrementing id and a text name for now. Obviously if you needed meta information about each category it could go here as well.

CREATE TABLE `category` (
`cate_id` int( 11 ) NOT NULL AUTO_INCREMENT ,
`cate_name` text NOT NULL ,
PRIMARY KEY ( `cate_id` )
) ENGINE = INNODB DEFAULT CHARSET = utf8 COLLATE = utf8_unicode_ci;

Initial Data

Ok, before we start, we need to create an entry to anchor the hierarchy in the tree table. This is because a pre-order tree traversal requires a single root element to encapsulate every node on the tree. I will assign it a category called "home" so that I can use it in breadcrumbs, however this entry wouldn't need to appear in the site.

For testing your code, you may also want to throw some test data. Just make sure to get your left and right values right when you add it in. You can use mine if you like.

Part 2: Search for a more flexible tree.

Oct 17, 2006

In my last entry we went over the what (I want a category management system that can scale out quickly, deeply, and allow me to perform common operations quickly on data that doesn't always fit in a nice neat little tree.) and the why (because the chicks dig it)… of an MPTT category system. Before we start dealing with how, I'd like to cover something I wanted to keep in mind throughout the project.

One of the biggest pains I have when dealing with website categories is in deciding what to do when a category repeats itself in the hierarchy. There is often no elegant way to add the same category twice into the same tree in a CMS. Often when our customers ask us for a star shaped cookie cutter we squish the data into a round shaped cookie cutter, call what we did IA, and feel proud of ourselves. As a developer, I constantly have to remind myself that "I didn't know how to create it" and "it'd be hard to accomplish" aren't always reasons a solution isn't worth developing.

Take organization charts. In the military I have, at times, served in up to ten positions on the same organization chart at once… sometimes on different tiers and often with varying categories of people underneath me (in at least two cases with people answering to me who out ranked me). Not an easy concept to automate mapping on a computer (I know I wouldn't want to do it) because in the real world people break rules.

Rule exceptions and random repetitiveness can be hard to plan for and generally the IA geeks will "filter it out" for the customer (defined as telling the customer it can't be done). None-the-less, any system we design should at least have the ability to take these situations into account on the off chance that the repetition makes sense.

Failed attempts

In practice with PHP, I used to pretty much stick to assigning a single parent id for every category in the system… or a NULL value if the category has no parent. It's slow and inflexible, but it's how the guys at pMachine do it… so how wrong can it be? (only a handful will get that joke)

I have tried several ways over the years to make this more flexible and more manageable business-logic-wise.

One idea is to split up my categories into subgroups. There would then be an SUV child in both the Truck and Family Car tree with separate ids. The ids were matched in a relationship table and I made sure the CMS only related content to one of the related ids (to reduce he chance of content repetition). Finally, I wrote business logic that looked for grouped categories every time content lists were called for and returned all the items which were related. The logic for this was pretty twisted and incredibly hard to maintain for larger sites... so I eventually gave up.

My second attempt was to try to not account for the tree structure in the data layer at all. The database held the category ids and I tried to maintain a flat file that held the hierarchy. New categories needed to be added to the flat file by hand. This wasn't a bad idea on the surface, but showing relational information became very difficult and I was very limited in what I could do.

With my last attempt I got a little smarter. I decided to abstract the category's identification and metadata from the representation of the hierarchal structure in the database. One table was just a list of category id numbers mapped to "slots" in a tree and a parent slot id.

CREATE TABLE `tree` (
`slot_id` int( 11 ) NOT NULL AUTO_INCREMENT ,
`parent_id` int( 11 ) NOT NULL ,
`cat_id` int( 11 ) NOT NULL )

CREATE TABLE `category` (
`cat_id` int( 11 ) NOT NULL AUTO_INCREMENT ,
`name` int( 11 ) NOT NULL ,
`description` int( 11 ) NOT NULL )

What I place in the containers no longer matters because we no longer have to track two separate unique category ids for the same category if it falls into the same tree. In other words, SUV can now have a unique category id and appear in as many branches of the same tree as I like… with as many combinations of children as I want, too.

The down side is that this arrangement required lot's of recursive queries to do simple tasks, such as building menus and Tim forbid breadcrumbs. You needed to query once for each parent category you needed and another for every id. I ended up only using it in my ASP development because at the time Microsoft AccessTM (your taxpayer dollar pays for a lot… but not high end website databases) had the ability to query against queries (in other words... it had views) and MySQL 4 didn't.

With MySQL 5, LAMP developers suddenly had views… which made combining the two tables much easier to handle. But I learned from my ASP version that I would just be replacing one slow system with twisted PHP business logic and too much IO for another.

Then came MPTT... But that, is another post.

Category management Part 1: The what and why

Oct 5, 2006

So I am going to develop a CMS from scratch. I know… crazy. I think I have some ideas that are a significant improvement over what's out there, however. But before I start getting fancy, I need to build the core pieces that will make up the system. First up is a category management system.

The idea for this piece came to me from Ryan, who had built a similar system in ASP years ago. What I am building is a modified pre-order tree traversal system to work over a modified nested tree with a non-dependent (because it's not part of the structure, just there for meta data) adjacency list model tacked on.

What?

I know. I was the same way before I read Joe Celko. Let me break this down a bit… but I'm going to work backward here.

Last is the adjacency list model. This one of the oldest computer hierarchial structures and probably how 90% of websites handle category management... including any site I have ever worked on. Think of it like bunch of boxes with lines between then… like the org chart at work. In a database you basically save the value of the parent node for every node on the tree. There is one major problem with this… pulling information about each node becomes not just difficult, but slow due to massive recursion… especially with large trees. Things like finding the path to root (breadcrumbs anyone?) and finding the children (submenus?) become a real drain on resources repeated for every page hit. Next term back is the modified nested set model. Think of it as a bunch of containers that fit inside of each other. Instead of saying item X is the parent of three categories, you say that three categories are contained by category X, which is contained by category Y, which is contained a single root node category z. In a database you basically assign left and right values to every node in the tree in a certain order.

The first and final term (hah!) is modified preorder tree traversal. All this defines is how we will crawl around the tree… and more importantly in this case how we will assign all of these left and right values. We're using preorder (as opposed to postorder or inorder) for two reasons. First, it defines a tighter hierarchical structure… suddenly I can define seniority by the position on the tree… removing the need to figure out sort order values (a pain if you've ever had to insert a lot of values between to node with sort values that are close together.) Second, it allows us move or delete entire branches of the tree move easily.

Look, I could go into detail about how that works and draw a bunch of diagrams, or save myself some time by pointing you to a tutorial online. The best answer, however, is for you to pick up Celko's book on the subject. It's geeky, but will give you far more information that I can fit here... and that's really where I learned everything I know anyway ;)

Benefits

First, I can complete fairly complicated traditionlaly recursive procedures on my data entirely in the data layer. This is important because there are two relatively complicated pieces of logic I am going to include on just about every dynamic page I have to create – building menus and piecing together breadcrumbs. Anything I can do to make this procedure run faster or more efficiently will bring back gains… and in sites with huge category systems, the gains will be HUGE.

Second, in an adjacency list model the number of tiers in your tree are limited to the business logic you bake. If you want to add another tier, you need to add another recursive layer. With nested sets, you can throw as many tiers on as you want with no changes in logic. Now I don't have to explain to my clients during QA why they can't go ten tiers deep.

Finally, parent child relationships are set for the entire depth of the tree. If you have a branch or leaf and want to know if it is a child of a parent three tiers up, you're doing three recursive two join queries to get the information with adjacency list model. With Nested sets you perform a simple equation which can be done as easily in your business logic layer as it can in your data layer.

The down side

Ah, but I am always looking at the trade offs, right? The arrangement is far from perfect, hence why we've modified a few things in our structure.

First, as everyone seems to point out, the system is hard to write. I know programmers are supposed to be lazy, but I am an efficiency nut and willing to put in the elbow grease. Several people have done the work for me already, but it was written in PHP business logic instead of SQL queries which could be abstracted to the database layer (MySQL didn't have stored procedures or views when a lot of them were originaly written). As such, I'm taking my entire structure from Celko's book.

Second, read functions are MUCH faster than write functions in this system. As such, while this is not something you are going to want to use if you will be creating a lot of new categories. However website structure doesn't change very often, so this problem really doesn't confront me. But if I build a newsgroup module or something where I'll be writing new categories all willy-nilly, I'll want to use a different structure.

Finally, any time you add or subtract from the tree, you need to rewrite all the left right values to the left of the node you changed. You have to be careful to make sure the operation finishes without interruption or you will break your tree… and that will be a pain to fix.

Up next?

Creating the table structure.

Gone Fishin'

Sept 24, 2006

So I recently moved to Secane - which has been great being as how I don't call my parent's living room home anymore. Only problem is that apparentl it is easier (and cheaper) to catch an internet connection in Third World Africa than Secane, PA. Just a quick update and we'll slide right back into play.

I have got to find a better solution

The the most recent site I developed launched. The design was by Josh Lane, who also happens to run "Animated Bliss." Honestly he did a good share of the development on the project, only comming to me when he ran into the wall that is ExpressionEngine... the CMS used for the project.

From a development standpoint it was a terriffic lesson on the difficulties of forcing a canned content management system to serve a customized site. For those not familiar, EE is an open source PHP-based system by pMachine. It handles the blog concept pretty well and has a fairly well thought out hierachial category system and a propritary script templating system which allows a designer to access some basic scripting language tools (such as loops, database access and environment variables) without learning php. It is designed to be able to create and mesh multible blogs into a single feed and does a good job on simple sites.

Obviously, there are some applications a program like this works well for. I would have loved to know about it when I was out at CJTF-HOA. We could have set it up and let unit representatives blog their little hearts out... each on their own unit blog under the umbrella of the CJTF-HOA website. Categories assigned to each article would then mesh accross the site, so if someone wrote about a school dedication from HMH-464, it would be linked by category to similar articles from other bloggers. Mesh that in with the official PAO website and news feed, and you have a very interesting tool.

But there are some severe limitations to EE that are worth mentioning. First, the business application side of the CMS can be a little frustrating for the less computer savvy. Controls are not very intuitive and if there is a way to lock contributors out from being able to mess with the templates or data structure of the site, I havn't been able to find it yet. Every Site I've seen deployed with the system has required a significant amount of training for the client to be bale to use the product... and I'd be a lot happier with something a bit more intuitive usability-wise.

The larger problem for me is that my designers are creative. EE does not play well complicated page layouts or customized database queries. An example from my most recent project involved search features Josh came up allowing the end user to search for services by topic and by county or counties. For instance, "all sevices 'X' in Delaware or Montgomery county or Chester counties"... or any combination. No problem programatically... just write the query on the fly. But EE doesn't allow SQL queries with 'ORDER BY' or 'GROUP BY' statements. Also, it doesn't play well with JOINs.. and I hate writing multiple table WHERE statements.

Another large problem we ran into is in how EE saves data. Namely, all of your core data into one table. It caused us to have to work some pretty complicated joins to piece everything back together... not a great thing - performance wise - when you're sifting through a few thousand entries.

The answer we settled on was a combination of using MySQL table views and writing most of the display template in PHP. This ended up working in the end... but it leaves me wondering what the point of using a CMS is if I have to write the display tier logic myself?

Microformats

Sept 3, 2006

the deal

You can't research current trends in sharing information on the internet without coming across the concept of Microformats. I first heard about them from Joe, an intern at E-Ink from Carnegie Mellon University. It was one of those conversations where you nod intelligently at the time as though you have a clue and then run home to read up after the fact.

Wikipedia defines Microformats as:

"markup that allow expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a standard web page that is marked up with microformats."

Basically, Microformats give you the ability to define the nature of information in your documents. That way, instead of just seeing and passing through random text, your computer can begin to make sense of the categories and definitions of the data being used. Things like "this is a family name" or "this is a person and she works at this company."

XML does a descent job of this, however XML files are not necessarily human readable documents. (Well… they can be… but when you set your definitions you lose the semantic nature of the resulting markup.) Microformats allow you to keep this meta information in your final HTML document. Nice thing is that the definition is more loosely defined than XML… placing the information in attributes instead of the elements themselves. You can declare a given-name in a span, div or anchor tag… vice XML where your usage is constricted by a strict DTD.

The problem

Despite the fact that I have definitely bought into the concept, I ran into a few problems implementing. First, the microformats website is poorly organized and… while they tell you (in geek terms) what the definitions are… how they are being used is not handled so well. (well, honestly it might be… but I was in search of a better source of information after about ten minutes.) I found myself continually asking "why do I care if my markup is in human readable form?" In the end I learned that wasn't the point at all... but only after I started reading other resources.

Another potential problem I see is that some functionality or various specifications overlap. For instance, when should I use rel-nofollow vs VoteLinks… or when should I use hCard vs XFN. I think I've found my answers for this site… but I would love to see more about how they intended the various specifications to work together in the same document.

What's implemented

  • hCard – I am using hCard to provide contact information for myself and others mentioned on the site. For the most part, just websites… since it's an unfortunate fact that spammers will suck the system dry for e-mail addresses if given half a chance. (Shame the government doesn't fine people for being assholes in general.) If you look, you'll notice divs in the markup containing hCards are in there now. I'll use CSS to hide them on screen and when printed… but if you are running tails for firefox or a similar program, it will register the links.
  • rel-license – Will be attached to all of my license links to help identify them.
  • rel-nofollow – Once comments are set up, any links provided by non-recognized users will have no-follow applied by default. I consider this to be a great anti-spamming tool if the major search engines will be willing to buy in. At the moment I am taking a "build it and they will come" stance on the specification.
  • rel-tag – Will be used once I have the tagging system set up. As I plan to use my category management system in a similar fashion, I may also place this tag on my entries until the rel-directory specification matures.
  • XFN – Probably the corniest specification I'll be using… but since my end goal is to integrate this with all the other social aps I use… I figure this might be useful information later on. It allows you to mark people by what kind of contact they are. (So, the browser will know that MoBones is my sibling and Mark is a friend that I have met.)
  • XMDP – Probably the microformat most people have seen and used without knowing what they were using… it specifies a bunch of meta tags to use for your documents. I have used meta author, keywords, copyright, date, identifier and script before… but reading the specification taught me a few things… like dates are supposed to be stored in ISO 8601 format (see the abbreviation text that pops up if you scroll over one of my dates.)

Not being used… yet

There are also a number of other specifications I am curious about using… but since they are not mature I am holding off until they are "blessed." Geo seems like it would be interesting… especially if I started geo tagging my photos in flickr… allowing me to show pictures taken near something I wrote about. hResume would be great and might make a cool addition to hCard in a professional portfolio. Rel-directory would solve my confliction over the use of the rel-tag attribute… giving me the ability to label a directory link. I may also use xFolk and/or VoteLinks… but I need to consider how I would implement them in a CMS before I get too excited.

Wanted: tools

So, what does all this mean? Not much, really. I have most of these microformats in use now. But how they will help out you, my dear readers, I have no idea. tails is the only program I know of which makes use of the technology… and it's implementation isn't really all that useful. Basically, it gives me a link to a person's webpage… kind of goofy since that information comes from an anchor tag… which a link to that person's webpage.

What I would like to see is browsers or delicious allowing a user to bookmark all the hCards off of a page into your browser or even Outlook. Or a relationship mapping program that could worm through blogs using XFN and build a "family tree of the internet." Having search engines pay attention to VoteLinks and change site rankings accordingly would be terrific… and get me working those into my site more rapidly.

But until more useful tools come out to take advantage of the new technology, I'll continue using it… and maybe building functionality into my CMS to take advantage of the extra data.

But that, is another post.

Layout

August 30, 2006

Picking a design

Those of you paying attention may have noticed me playing with layouts and graphics over the last few days on Bittermonk. The experience has taught me a few things.

The first thing is that I am not a designer. I went to school for print design. I am a hell of a photographer (in my opinion, at least). But when it comes to designing a great look and feel on a website, I suck.

Second is, that it really doesn't matter at the moment- especially for what I am trying to do here. As I developed my goals for the blog and the content management system behind it, I realized that fancy looking graphics would only take away from the focus.

Well, that and I suck at graphics.

ANYway, there is plenty of resources on the "three layers of separation" in web design... however all three layers exist for the data they present. My focus, for now at least, is on the data layer.

This makes sense... since I suck at design.

So, what you see here is what you'll get for a while. A couple notes.

Design features

  • XHTML 1.0 Strict - I decided to go with XHTML 1.0 strict. Since I plan on merging a lot of data sources into this website, I wanted the most flexible datatype possible. All efforts will be made to maintain a validating site no matter what data I incorporate.
  • Separation - Fear not, standards geeks... using XHTML strict ensures I have to separate behavior and presentation from structure.
  • Fluid layout - I am doing everything I can to keep the CSS as fluid as possible. You'd think that would be easy... being as how I have a text only layout at the moment... but differences in the ways IE, Mozilla and Safari proved to be a major pain.
  • Data Islands
    • All navigation will be on the left side of the screen.
    • Links to supporting information will be on the right.
    • Think of the footer as the credits. It will be display links to websites associated with people involved with the blog. If you are recognized by the CMS and are mentioned in an article or leave a comment or otherwise contribute content, your links will appear down there. (Don't worry, you get to pick which ones.)

Up Next

Developing a next generation blogging CMS from scratch is an ambitious goal. If it were an easy task, there wouldn't be so many developers with Word Press and Expression Engine Blogs. I figure the first step, however, is to create a quick database interface to tide me over until the core functions of the CMS are completed. Most important features? Handling comments and XML.

But that, is another post.

Reset

August 27, 2006

And with that, the site changed. For those of you who are reading through RSS, if you came over to the site you'd see that the old content has been stripped and the design dropped. Clean slate. Maybe you're wondering why?

In the past this blog was for me. It was a place for me to communicate with family and friends during travels and store information and links I was interested in. However, two things have changed for me recently - I went from being a roaming person often far from home to putting down roots in Philly and I went from being chiefly interested in content on the internet to how that content is delivered. I'm now going to blog more for the outside world.

Why the sudden interest in content delivery? I blame Tim... who got me started at Electronic Ink's Development Team. I've been doing development for some time now... but this is the first time that it was my job to be more concerned with delivery than the content itself. I'm hoping to use this blog to keep me mindful of what content creation entails as well as give me a platform to share what I am learning.

I suppose more so than a simple change in career - the move to Philly is prompting me to change what audience I am focusing on. The plain fact is that the majority of my current readership is here... and I'd much rather gripe about the government or pontificate about life over a glass of wine than write about it.

Long and short, changes will abound as I begin to build the CMS that will drive this site and complete a design. In the meantime, if you have any suggestions, let me know.

Brian E McElaney
Mac
Web Technologist
Electronic Ink
WORK
One South Broad Street 19th Floor
Philadelphia, PA 19079
Tim Crowe
Web Technologist
Joshua Lane
Designer
photo Electronic Ink
Digital design firm