Lightdesktop Blog RSS

Quick posts and comments and thoughts about development of the LightDesktop system.

Ask a Question

Archive

Apr
18th
Wed
permalink

Next Steps

So a few things to mention here - first off, the select()-based rewrite has made some great strides - plain files work fine, directories are up next. There are already a few simple caching mechanisms which ought to help make things run fast. Still more to do - directories, retries, symlinks, impossible-file detection - before it’s ready for primetime.

And that brings me to my next point. All the time I’ve been spending on working on the web-based filesystem, I have not spent working on the real ‘meat’ of the system - the apps and utilities and window manager and so on. And I am willing to bet that some people just want to use LightDesktopwithout the nifty web-backed filesystem - they just want something small and responsive to play with.

So what I am thinking I will do - and am nearly 100% decided on - is to make LightDesktop have a new ‘build target’ - a full CD or DVD install, that has everything in it. While it will *have* the latest version of CREST-fs on it, it won’t be using it by default. This should make it so that I can separately develop the apps for system, the system itself, and the filesystem that powers it all, all at the same time and all independently. It means updates to such a system will be unpleasant - probably a wipe and reinstall - but it also means one can just use the system without buying in to the whole webby cloudy business. Maybe I can even come up with some kind of clever way of ‘checking for updates’ by mounting CREST-fs and running some cron jobs? That would be a bonus.

I’m hoping this will just necessitate a new set of targets for the build scripts and some clever command-line scripting - and more hard drives space here and there. Ideally, that would be all. I’ll continue to investigate and keep you all informed.

Comments
Jan
2nd
Mon
permalink

More fun with CREST-fs

So I’ve always known at some point that I was going to have to switch CREST-fs to using a select()-loop. Right now, sometimes resources are fetched from the web, sometimes from disk, all running multithreaded. Especially in terms of reading-from-net, and writing-to-disk, you will block while waiting to read from the net, and you will block while you’re waiting to write to disk. Especially with multiple threads trying to do this all at the same time, you can have problems with multiple access to the same resource - one attempt to stat() a path, and one attempt to read its contents - there’s going to be multiple accesses to the net and disk all at the same time.

Trying to code all of this to work in a multithreaded environment has been, to say the least, a nightmare. And getting it to perform well hasn’t gone well either - no matter how I optimize, it doesn’t work right. And heaven forbid the network is actually down - then all bets are off. This ruins disconnected mode.

So, I’ve started to write a version of CREST-fs that uses a select()-loop in a worker fork() that reads requests from a unix domain socket, and then writes responses to it. Luckily, I’ve been working with Node.js a lot lately for work. So I am starting to get comfortable with this style of development.

Eventually, once I have a select()-based version of this working, I may move from the FUSE high-level library interface - which is based on multiple worker threads for each filesystem - to the low-level interface, which is asynchronous and fits well into the select() model. But that won’t be for a while. Right now, I’m just getting the worker to be able to receive requests, write responses - and, as a bonus, I’ve built in a caching system for metadata which should reduce disk I/O to some extent.

It’s going well. I’ve got it running in parallel to the existing system for now, but I intend to migrate all disk I/O and network I/O to this worker process over time. I’m using some of the previously-mentioned testing methodologies to isolate bugs and work through them, and I’m making good progress. I think my next step will be to get the network I/O migrated over, then remove the ‘training wheels’ to see how things work. I will also be using a DNS library like c-ares or ares to do asynchronous DNS lookups, so we will be able to keep from doing any blocking in the worker process. So far, the only blocking tasks I have in the worker process have to do with stat()’ing a filesystem object. I’m hoping to keep those to a minimum. And I *really* hope that I don’t have to move all of the stat() activity to some kind of thread-subsystem. I’m also considering changing some of the mechanics of the cache subsystem to not depend on stat() mtimes so that I don’t have to do stat()’s in the first place.

So it’s quite a lot of work ahead of me, but I’m really looking forward to the results. And I’m still holding on to my previous branches of code so that I can go back if it all messes up.

Anyways, just another update on where we’re at for today. Quite a few steps backwards, to be honest, but I’m excited about where this will leave us for moving forward again!

-B.

Comments
Dec
18th
Sun
permalink

What’s coming up

As I’ve mentioned, I’m working improving the filesystem, CREST-fs (which is available at GitHub - https://github.com/uberbrady/CREST-fs . If you’re interested, check out the 1_7_5 branch, that’s where all the new fun lives).

First off, I’ve done some improvements to the server-side. Reducing the number of calls to RS Cloud Files for each entry. This has obvious advantages, of course.

Since RS Cloud Files occasionally will fail on some requests for no good reason, I’ve modified both the server-side and client-side to be more tolerant of 500-series server errors. On the server side, any Exception that gets thrown will generate an “HTTP/1.1 500” -style error. On the client-side, CREST-fs will now retry - up to five times - a URL that comes back with an HTTP status of 500. Eventually (after 5 times), it will give up and should throw an I/O error if it can’t fetch it after that.

The performance improvement I’m most excited about is something called ‘leaf-optimization’. A directory with a whole bunch of descendants - 

a/b/c

a/d/e

a/f/g

a/g/h

Up to around a thousand or so entries - can be completely refreshed with a single GET call for a/. In the common case - where you’ve already cached those files - one GET can save you hundreds of HEAD and GET requests!

Coming up on the improvements side is work on the read-write filesystem - asynchronous directory creation and file deletion. This is still alpha, and I’m still working out the kinks. I’m trying to make sure that read-side performance and stability are top-notch before I focus too much on read-write.

Probably most important, though - though far less exciting - is a change of methodology. I now have test rigs for fetching a single resource (relative to the cache), or checking its “freshness”. This makes it easier to test individual changes and optimizations I’ve been making. Next, I’ve stolen some of Netflix’s ideas about designing for cloud failure - Netflix has Chaos Monkey, which causes individual nodes to fail. So on the development version of LightDesktop, I now have Chaos Marmoset and Chaos Pygmy Marmoset. Chaos Marmoset will cause the server to randomly (currently, 1 out of 5 times) cause a request for an individual URL to return a 500 error. Chaos Pygmy Marmoset will cause the HTTP connection to the server to randomly close around 1 out of 10 times. This will ensure that the client and server are prepared for the odd resource to not be fetched properly.

The last change is that I’m trying to insulate my Mac OS-specific changes to one Makefile, my Linux-specific ones to another, and my LightDesktop-specific ones to another, keeping the common Make stuff in one Grand Unified Makefile. This should help as I move back and forth across the various platforms I develop with and use.

Right now, the Read-only stuff is getting very, very stable. I have one bug I haven’t figured out quite how to reproduce, where my stress-test (find-and-md5 every single file on newfs.lightdesktop.com from an empty cache) will just start reporting “File not found” repeatedly and exit after a long while. There’s also a slight nuisance with how its handling symlinks (it keeps refetching them unnecessarily). I want to drive these ‘known’ bugs to zero before I start adding new ones with the read-write stuff. Then there’s only the ‘unknown’ bugs…

(PS, and yes, I know that a Pymgy Marmosets is a subset of the class ‘Marmoset.’ I just can’t come up with cleverer names.)

Comments
Dec
12th
Mon
permalink

Back at work

It’s been a while since I’ve updated anything - just wanted to let everyone know I’m back at it. Got some ideas about how to deal with some filesystem flakiness issues - but right now I’m just re-acquainting myself with my code.

Hopefully more to follow soon.

-B.

Comments
Dec
29th
Wed
permalink

Server-side Filesystem improvements

We’ve just finished development on some improvements to the server-side ‘anonymous’ filesystem. So far, this only is in place for the experimental release of LD, but once any bugs are shaken out and repaired, this code will be put into production on the Live production filesystem, and the live production authorized-only storage system.

Further details: The Rackspace Cloud Files API that is available was not as efficient at accessing resources as one might hope. We were able to reduce the number of requests made to access files or directories down from 6-8 requests down to 1-3. Most of the common cases (requesting a file, or requesting a directory, usually with Etags that match) have been optimized to use 1 request.

We’ll make another announcement when the experimental filesystem code goes into production.

Comments
permalink

More Reliable Installation Experience

We have modified how the LD installer attempts to install the MBR (Master Boot Record). Previously, some partitioning actions may have damaged the MBR that was being placed on the boot partition, in some case preventing boot of the newly installed system. This should no longer be the case.

Other changes are necessary here for the future, for example, something that *ensures* that you’ve picked the LD partition as ‘active’ would be extremely helpful, and perhaps a graphical partition manager (Perhaps something like QtParted ?) could be nice. Also enabling power users to elect *not* to have their MBR’s overwritten might be a good step. And transitioning from a command-line prompted script to a GUI installer - all of these steps would be nice to have. But other priorities come first, of course.

In the meantime, if anyone has trouble with the installer (*and* you’ve actually picked your partition to be ‘active’) - please let us know.

Comments
Dec
2nd
Thu
permalink

globetrotterdk asked: I have noticed that you installed Lightdesktop on an Eee PC. I have an Eee PC 701 I would like to install LD on. Can you post instructions for getting the wireless working?

That sounds like a good idea; I’ll try to do that sometime this weekend.

Comments
Oct
22nd
Fri
permalink

LightDesktop Release

The Development release of LightDesktop has now been promoted to the production release. This marks two new goals: #1) LightDesktop is now completely self-hosted. Any work I do on either the Development or Production versions will be done from a LightDesktop machine (or VM) itself. And #2) The LightDesktop filesystem (fs.lightdesktop.com) is now hosted on Rackspace Cloud Files, and not from a conventional Linux filesystem any longer.

The initial switchover to the new hosting should be transparent. I did a test transition and it went completely smoothly. The new hosting means that all of the checksums for all of the data on fs.lightdesktop.com are now invalid, so individual computers may be sluggish as they re-cache whatever data they need from the new server setup. In all probability, no other transitions should cause such a re-caching, as the new “Etag” scheme uses cryptographic checksums and can be replicated in any hosting platform without causing further re-caching events.

Comments
Sep
25th
Sat
permalink

Final (fingers crossed?) dev release

I’ve spun one more development disk image to fix a minor issue with the ‘wireless’ command from one of the menus which seemed to not be triggering properly.

If that’s the last of the issues, this will become the production release - and we’ll get a chance to see what happens to the server when a release changes…

Comments
Sep
22nd
Wed
permalink

New Dev release

Made some tweaks to the Dev release - nothing really substantial, just some bug fixes with the installer, updater, and the ‘emergency terminal’ that goes on one of the VT’s in case the filesystem breaks. This is really ground work for being able to spin a Dev release or a production release on a whim, which will be important.

Everything else pretty much works how I want it to. Once I’ve run this one through the wringer, I think we’ll be ready to make this into the Production one, and start a new Development branch - where I can go a little more nuts with making changes.

Comments