I have a PHP/MySQL-powered website running on CentOS, which is some variation of Unix or Linux. Recently, we started getting this error…

Warning: mkdir(…) [function.mkdir]: Too many links in … on line … 

(I replaced the paths and line number with …)

Here’s the only relevant and useful page I could find on the problem: ext3: too many links. As you can see, the problem is that with the ext2 or ext3 filesystem, there’s a hard limit on the number of subdirectories you can create within a directory. This limit seems to be about 32,000 - which is exactly what I observed, too.

A commenter, Topbit, claims that it takes some 10-15% time to search through a large directory. He made it into a 100×100x100 hierarchy, which sounds pretty smart to me. So I’m doing the same thing. But I had to figure out how to structure this hierarchy. So I ran some experiments on a spare box.

Each folder is a unique identifier string. Let’s call this “name”.

First, I tried taking the first 2 characters of the name, 2 levels deep. With 2 characters, you can have 26×26 = 676 folders in each level. But it also means dealing with the overhead of calling a function (in PHP) and using substr.

A better method I decided on is to use the first 1 character of the name, again 2 levels deep. Surprisingly, this 2-level-deep policy appears to work remarkably well. Here’s why.

With the first character of the name, 2 levels deep, I was planning to have names like:

/a/p/ple - if the name is “apple”
/o/r/ange - if the name is “orange”

…And so on. That’s fine, but most of the names start with “th”. 738 of them, in fact. So it’s not, by any means, an even distribution. That might be OK, but if we wanted to, say, backup these files in 2 separate places… it would be very hard to find a halfway point that would work well. Plus, a lot of folders will have just 1 subfolder… while others have 400+. Doesn’t sound too good to me. Average: 26.99 folders in the third level, standard deviation: 60.95.

So I looked into doing a hash. A very simple hash function is md5(), so I tried that. Much better even distribution, but some overhead. While running the above experiment took just 0.9 seconds, this one took 22.1 seconds - and the only difference is running md5() on every name.

But we get better stats: average 99.5 folders on the third level, which is higher, but a standard deviation of 12.5.. sounds better to me. Of course, the trade-off is now that we have to get the md5 hash. Since this is only going to be computed once, I think this is OK. The other downside is that we cannot immediately see where any given name’s folder is. We have to md5 it first. But when would we actually need to know this? I’m not convinced it’s necessary.

Still, this is a really difficult decision to make. After writing this post, I’m now leaning towards simply using the first 2 letters… because this makes it incredibly easy to go back-and-forth. The downside is inconsistency in the number of subfolders, but that is acceptable because even 738 is a manageable number… no, it’s quite large still, and diving into the folders when so many of them have only 1 within them sounds very annoying to me. I will be going the md5 route.

This entry was posted on Wednesday, October 17th, 2007 at 2:10 am and is filed under Web Development. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “Warning: mkdir [function.mkdir]: Too many links in … on line …”

  1. Topbit on October 17th, 2007 at 4:30 pm

    Clarification: The 10-15% referred to the Linux IO Wait time (visible in `top`) caused by having to traverse a list of thousands of names.

    For this particular application, our files were already relatively random, a long string of digits. We took off the extension (.jpg) and then used the last four characters (being the most ‘random’), in two groups - so 123456(78)(90) mapped to a file ./90/78/1234567890.jpg.

  2. Kenny on October 21st, 2007 at 10:57 am

    Check out this post from 37 signals on how they implement something similar: http://www.37signals.com/svn/archives2/id_partitioning.php

  3. Ryan on October 21st, 2007 at 9:52 pm

    I use the folder hash method with sha1. Speed was not really a concern since this hash is only found once, and I never needed it again. Right now, I only take the first letter of the hash. I’ve been using only the first letter, but I think I’m going to switch to the first and second.

Leave a Reply