The Secret Behind MediaTemple's Grid-Server
22 Oct '06 - 06:14 by benrThere has been a lot of buzz around (mt) MediaTemple's latest offering this week: (gs) Grid Server. I listened to a podCast at TechCrunch and was really sucked into the marketing speak about the offering. But as a SysAdmin I wanted to know how it worked. The key to the product is that you setup your enviroment once and its "automatically deployed on the grid", thereby even your little site is benefiting from the collective resources of the grid.
I had to know how it worked. I called MediaTemple but they wouldn't tell me anything... frankly I don't think the guy I talked to even knew. So I bought an account to look for myself and found something very very interesting. The real story isn't MediaTemples Grid Server, its actually BlueArc.
The secret behind GS isn't revolutionary, but it is clever. Basically they have, at last check, 17 systems running Debian. Thanks to a BlueArc press release I know they bought a Titan back around the middle of this year. They had a relationship with HP, found via Google in an HP success story, which leads me to believe they are still using HP systems, and specifically I think they are using HP ProLiant DL360 2.00GHz G5 Servers based on data from /proc. Interestingly there are 4 Xeon 2Ghz cores and only 2GB of memory per system. There is no local storage, instead the systems boot a root filesystem via NFS, and user storage is also mounted NFS.
The Grid magic is this: store all use data on NFS so that no matter which system you connect to you can access the data. Then spread your vhost configuration to all hosts in the "grid", so that any system can serve your data. This system is therefore highly scalable because adding an additional node to the "grid" is trivial and reliable because if one system dies, big deal. But this means that you require two things to make it work: really good load balancers and really good NFS storage. And by good I mean very reliable and extremely fast.
And thats where BlueArc Titan fits into this story... without the performance offered by BlueArc Titan the Grid Server concept just can't work and becomes a disaster. Putting all user data on the Titan is a big vote of confidence but putting all the root filesystems on it says something even more telling. No doubt the idea of putting root filesystems on NFS was not to reduce componants in the servers but to facilitate provisioning and change management by means of cloning a "golden root" and rebooting each machine.
I have no idea what load balancer they are using. Apparently whoever it is isn't putting their name in a press release. In a setup like this I'd only choose to go with F5 BigIP, but who knows. They do have Pound installed on each node but I can't imagine that they'd spend money on systems and storage but not on load balancers.
Of course, this leaves one problem, especially if your a Ruby on Rails developer: PHP can be served by any host by Apache, but Rails apps use their own webservers (WEBrick, Mongrels, or ligHTTPd). Thats where the (mt) Containers come in. I'm less sure about how that works, and frankly less interested. Basically you create a little container (64M in the low end account) within which you setup your Mongrels and that then starts the binaries in n number of grid nodes. This applies to any application that requires running binaries, so Java developers aren't welcome (untill they design Tomcat/Geronimo containers). If your a developer, look before you leap, (gs) might be great for static content and Apache CGI, but otherwise look elsewhere. These are by no means to be confused with real containers or what many call "Virtual Private Servers" (VPS) or even "Virtual Dedicated Servers".
Back to BlueArc, the real story here, I'm impressed that (mt) trusted their solution to them. Its a testimate to the reputation BlueArc is building in the industry. I am a little interested in the configuration in terms of performance because I found that with 8K blocks I get 102MB/s in a TextDrive Container (NFS on Thumper) vs 72MB/s in a MediaTemple Grid Server (NFS on Titan), shocked actually, I would expect the BlueArc to blow away Thumper, but I'm withholding judgement for now. What I'll be watching is how the performance changes over time, as (mt) moves more customers (new and old) into the "grid". If I do a benchmark in 6 months will I see the same performance or reduced? When there is maintance or failure on the Titan (unlikely as that might be) will it take down the entire site? It shouldn't of course, but that depends onwhether (mt) bought a redundant configuration. In short, the fate of (mt) rests squarely on that device... lets see how things go.
Anandtech & Enterprise Storage
04:02 by benrI haven't checked Anandtech in months, but felt the desire tonight. Frankly, its the best hardware review site it is, and Anand Lal Shimpi's book The AnandTech Guide to PC Gaming Hardware is quite simply the best book ever written about X86 hardware, bar none, suffering only because its title massively misrepresents the bredth of content.
Reguardless, I was felt the ned to point everyone toward an excellent article written by Johan De Gelas: Server Guide part 2: Affordable and Manageable Storage. Being Anandtech its vendor neutral and focuses on the disk technology and not so much how its put together into an enterprise package. This is a must read for anyone interested in storage!
Pillar Data Desperate for News?
20 Oct '06 - 21:09 by benrI couldn't help coming across the following article found pronounced on Pillar Data's front page: NetApp Unseated at NASA Unit. The article is from August but Pillar is still proud of it.
I'll save you the hassle of reading it... basically: NASA's Solar Data Analysis Center bought a NetApp F840 in 2002. That was about toward the end of the F840's life, so it wasn't hot hardware even at the time, so they must have bought just prior to the release of the FAS900's. Reguardless, they bought it in 2002 and have been happy with it. Now its getting old (really really old) and so they bought a Pillar Axiom. At first they were mirroring the F840 to the Axiom but now, shock of shocks, they flipped and the F840 is now backing up the Axiom! Talk about hot news!
So how is this news? Why did ComputerWorld write this crap? I mean, I'm migrating data from a CX300 to Thumper (Sun Fire X4500) and decommishing the EMC... is that news? Should I call ComputerWorld with the exciting announcement?
Now if, for instance, NASA took the FAS840 into a field and filled it full of C4 so that it didn't stink up their datacenter, that would be something noteworthy. But they are still using it as a backup solution, which tells me that they like it and are getting a decent ROI.
Its no secret in the storage market right now that Pillar Data is trying to increase its client base so hard that they'll basically give you the thing for free if you haggle enough. They need to build a client base to show that they are viable and with Larry's cash they'll do anything needed to make this brick fly. Frankly, Pillar is a company that just doesn't excite me. They call ancient performance tuning and allocation techniques 'revolutionary tiered storage' and are building a boring FC only device. They probly have more promise than EMC, the bloated and boring beast who's more interested in virtualization than storage, but they just don't bring anything terribly exciting to bear. Perhaps I'd feel diffrent if I had some hands on time with one, maybe its a better solution that I think, I'm accepting that I might be wrong... but I don't see Pillar winning on its merits, I don't see people saying much good about it, just that they are cheaper, and they aren't cheaper because of list but because they will undercut anyone that gets in their way, and that sort of "user car salesman" tactics just doesn't fly with me.
Pillar benchmarks suck. Nothing exciting is out there about them. They whore themselves whenever possible. They've got great marketing, I love their ads, very attractive, but little else. In a sense I with they'd just hurry up and die.
Like I said, though, I could be wrong. Maybe its a great product and I'm just not seeing things straight. If so, speak up, they sure could use an advicate.
Blackbox is Real
17 Oct '06 - 21:15 by benrAfter I learned of Project Blackbox I got excited... real excited. Anyone who has been around Sun in the Bay Area will recognize in the video the building behind it: the Executive Briefing Center on the Menlo Park Campus. I live across the bridge so I kicked my Volvo S70 T5's turbo into action and flew across the bridge. And I found what I sought....
It is really real. On the net the thing looks like a gag... a complex and complete one, but still a "Is this April?" check is called for. Clearly Sun planned this out ahead of time quite well, there was tons of poster sized pictures around the Briefing Center and even little scale models of it (gotta get me one of those).
Sadly, it is cramped in there and so many customers were being toured through it that I only got to walk around it.... but visable from the door was this:
Force10 Networks... hell ya. Force10 is a company that I'm watching very closely. I'm very interested in deploying these babies for my high speed iSCSI data fabric and seeing several of them (2 32port switches were below this chasis) installed was a very pleasent supprise.
One last look at the hookups...
See? Just disconnect your washer and dryer and place your orders now!
On a more serious side. I'm suprised by the Wall Street reaction. Customers seem to be excited (from the buzz I heard around the demo onsite today) and people are lining up to get a look at it. This is one very serious show of engineering force. Has the idea been thought of before? Yes. But did they dare to do it? No.
I'm reminded of an interview that Mike Judge did with Jay Leno about 'Beavis & Butthead', Jay had him do the voices and said "What? I could have done that!" to which Mike quickly snapped "But you didn't! I did!" A classic line. I'm sure HP and IBM and others will downplay this offering, but it says something powerful about Sun, they aren't just talking, their doing. Opening Solaris happened, putting a datacenter in a box happened, making the global network the computer happened, and building a platform that ran anywhere for everyone happened.
And so then why, might I ask, did wall street give Sun a slap?. I can venture several guesses, but I don't like any of them. Whether this product takes off or just becomes a page in the history book, one thing is clear, its a defining moment for the industry and for Sun and one that everyone should be very very proud of.
Joyent & Sun: The Movie
01:36 by benrHi... remember me? I used to blog and stuff. But like, now, I have to actually work for a living. It was nice at Homestead, my former place of employment, because I'd been there long enough to have automated and built almost everything. I did like maybe 3 hours of real work a day. It was sweet, and I blogged a lot. But, now I'm a new guy with lots of stuff to design and build, no sleep for me.
Wanna see the guys I work for? Sun recorded a nifty video clip a couple months ago (before I came on staff). You can see "the office" in San Anselmo and some of our server in San Diego. I work at home and only venture north (about 60 miles or so) once a week.
Click on the image to see the flick. Its a nice spot.
While I'm crazy busy its really kool to be working full time on putting OpenSolaris to work. In a sense OpenSolaris evangelism is now a full time job, showing to both my employeer and our customers just how kick ass OpenSolaris is and what it can offer them while helping drive the project in a hands on way as a contributer. Its a lot of work but I'm proud of what we're building.
Thump Thump
10 Oct '06 - 05:18 by benr
[thumper3:/splash] root# time dd if=/dev/zero of=testfile bs=1024k &
[thumper3:/splash] root# zpool iostat 1
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
splash 11.2G 20.0T 96 289 11.3M 34.6M
splash 11.2G 20.0T 0 4.15K 0 531M
splash 11.2G 20.0T 0 3.98K 0 509M
splash 11.2G 20.0T 0 3.99K 0 511M
splash 18.2G 20.0T 0 1.56K 0 101M
splash 18.2G 20.0T 0 0 0 0
splash 18.2G 20.0T 0 0 0 0
splash 18.2G 20.0T 0 0 0 0
splash 18.2G 20.0T 0 727 0 90.9M
splash 18.2G 20.0T 0 3.42K 0 438M
splash 18.2G 20.0T 0 4.21K 0 539M
splash 18.2G 20.0T 0 4.19K 0 536M
splash 18.2G 20.0T 0 4.00K 0 511M
splash 18.2G 20.0T 0 3.71K 0 475M
splash 18.2G 20.0T 0 3.83K 0 490M
splash 18.2G 20.0T 0 3.86K 0 494M
splash 18.2G 20.0T 0 3.78K 0 484M
splash 18.2G 20.0T 0 3.74K 0 479M
splash 18.2G 20.0T 0 3.72K 0 477M
splash 18.2G 20.0T 0 4.20K 0 537M
splash 25.1G 20.0T 0 2.05K 0 164M
^C
Half a gig a second? Not bad. Too bad I still can't get the iSCSI Target to push a remotely decent amount of throughput, not to mention the memory suckage:
[thumper3:/] root# prstat PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 458 root 100G 7112M cpu2 0 0 0:01:01 22% iscsitgtd/18 ...
...back to work.
Solaris 10 Update 3: No iSCSI Target For You!
08 Oct '06 - 06:29 by benrIn a strange turn of events, I logged into the Sun Beta site to start downloading the S10U3 Beta DVD ISO to my Joyent Jumpstart server so that I could load it onto my test Thumper and I see this message:
The inclusion of iSCSI Target Disk Support in the [thingy] was an error. This feature is not in S10 11/06 Beta and will not be in S10 11/06 RR.
This is a massive bummer. Currently Joyent is running on OpenSolaris B43. I was seriously considering going back to Solaris, since everything we need would be in a properly supported release (iSCSI Target, RAIDZ2, Zone Cloning, etc)... but leaving out the iSCSI Target is a deal breaker. So, I guess we'll be on OpenSolaris dev builds for a while longer.
ZFS within a Zone: Using Datasets
01 Oct '06 - 04:25 by benrAround B43 Solaris Zones were given a new configuration attribute: dataset. This allows us to provide ZFS within a zone itself.
Before I continue, I think we should talk about ZFS terminology for a second. When I first started out with ZFS I found this idea of nested filesystems a bit odd. I remember watching the flash demo's created by Dan Price and wondering why he kept creating filesystems within filesystems. When I see /storage/users/benr I think directories, not nested filesystems. But, over time as I've used ZFS more and more I've learned to adapt my thinking and now see the power and flexability provided. Instead of thinking of nested filesystems we can think of datasets within a pool which provide end-point filesystems. Of course to really see how this works you should sit down and look at the beauty of ZFS's design (this is a good starting point, or listen to the designers describe in this video). And so for now, a pool is base structure that ties layout to disk, datasets are abstractions that act as a root for other filesystems, and filesystems are what you typically think of for storing data.
Back to the joys of ZFS... If your already using ZFS and Zones you have everything you need, you just need to connect the two. Lets see how.
Lets first take a look at the zone we're going to modify using zonecfg's info option:
root@aeon ~$ zonecfg -z playzone001 info
zonename: playzone001
zonepath: /ultrastor/playzone001
autoboot: true
bootargs:
pool:
limitpriv:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
net:
address: 10.0.0.24
physical: nge0
As you can see, nothing special or interesting about it.
Now lets create a ZFS dataset, a filesystem that will host future filesystems within the zone:
root@aeon ~$ zfs create ultrastor/playzone001_ds
That was simple. Now lets give that dataset to the zone to use. We'll use zonecfg to set the "dataset" resource's "name" property which points to the dataset we just created. After making config change we'll need to reboot the zone to make it take effect.
root@aeon ~$ zonecfg -z playzone001 'add dataset; set name="ultrastor/playzone001_ds"; end; verify; commit' root@aeon ~$ zoneadm -z playzone001 reboot
Done! Now lets have a looksie...
root@aeon ~$ zlogin playzone001
[Connected to zone 'playzone001' pts/15]
Last login: Sat Sep 30 02:07:13 on pts/21
Sun Microsystems Inc. SunOS 5.11 snv_47 October 2007
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
ultrastor 251G 115G 66.3G /ultrastor
ultrastor/playzone001_ds 24.5K 115G 24.5K /ultrastor/playzone001_ds
# df -h
Filesystem size used avail capacity Mounted on
...
ultrastor/playzone001_ds
366G 24K 115G 1% /ultrastor/playzone001_ds
Notice that its already mounted, but the mountpoint is ugly. Lets fix that and create a new filesystem.
# zfs set mountpoint=/zfs ultrastor/playzone001_ds
# df -h
Filesystem size used avail capacity Mounted on
...
ultrastor/playzone001_ds
366G 24K 115G 1% /zfs
# zfs create ultrastor/playzone001_ds/web
# zfs set mountpoint=/web ultrastor/playzone001_ds/web
# df -h /web
Filesystem size used avail capacity Mounted on
ultrastor/playzone001_ds/web
366G 24K 115G 1% /web
# zfs snapshot ultrastor/playzone001_ds/web@snap1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
ultrastor 251G 115G 66.3G /ultrastor
ultrastor/playzone001_ds 71.5K 115G 24.5K /zfs
ultrastor/playzone001_ds/web 24.5K 115G 24.5K /web
ultrastor/playzone001_ds/web@snap1 0 - 24.5K -
As you can see, using datasets within a zone is easy to do and adds a lot of power to your container deployments.
It should be noted... functionality within a zone isn't prefect, for instance the pool and dataset isn't hidden bellow the zone. It'd be nice ultimately to mask that somehow so you just saw "zfs" as the pool for instance. More noteworthy, however, zfs create -V (size) won't work within the zone, it gets into a weird state where the filesystem is created but the size ignored, and then you can't destroy it from within the zone, so you have to go back out to the globalzone and fix it there. But, thats pretty minor.
Solaris 10 Update 3 Beta Opens
04:05 by benrThe Solaris 10 Update 3 beta is opening up this week. Notices for the program went out last week and applications are being reviewed now. By the end of the week it is expected that people will start installing and testing it. The current release is labeled "Solaris 10 (11/06)" so we're hoping for a November release. There was a lot of work poured into B49 of Nevada which just closed a week ago, and perhaps this is why. The current slated feature list includes (but is not limited to):
- Solaris Trusted Extensions
- APOC 1.2
- Enhanced Security for Limited Networking Profile
- Macromedia Flash Player v 7 for Solaris
- iSCSI Target Disk Support w00t!!!!
- Packet Filtering Hooks for IP Filter
- Network Layer 7 Cache
- Zone Cloning, Moving and Migration
- Configurable Zone Privileges
- ZFS file system enhancements
- Java Desktop System support for Access Control Lists in Nautilus
- Upgrade to Japanese IM Language Engine to Wnn8
- Sun Java System Message Queue Platform Edition 3.7
- Sun Java System Application Server Platform Edition 8.2
- SNIA Multipath Management API support
I would normally hesitate to "leak" a list of forthcoming features like this, but thanks to OpenSolaris everything you see above are things that we already have and in many cases are running now. In the case of Joyent, my employeer, we're running most of the things on that list in production now.
Whats most exciting is that for those of us who are running these things in production right now this will give us a chance to go back to a proper fully-supported Solaris release, at least for a while. Good good things are a comin'.
Trusted Extensions
03:54 by benrDocumentation has just been released for Solaris Trusted Extensions, find the docs here at docs.sun.com.
I had the great joy of seeing a presentation on Trusted Extensions at the Silicon Valley OpenSolaris Users Group (SVOSUG) meeting this week where Glenn Faden presented (his slides are here, see the comments for a PDF version). While a lot of the goodies created for Trusted Solaris have been with us for a long time it was interesting to see it in its full glory during the presentation. Perhaps most interesting was that Trusted Extensions was integrated into OpenSolaris awhile ago, in NV B42a (Solaris Express 7/06. What you might want to know is that while everything Trusted Extensions needs is already on your system, you need software from the "Extra Value" directory during the install to actually activate it all.
Trusted really takes security to the next level. While RBAC and Solaris Priv's are kool for just about any task, Trusted goes much further, much much further, Area51 further. When I was watching the presentation I was struck by the fact that most security is about keeping the bad people outside of your organization from getting it... but Trusted Extensions is about going so much further to the point that your protecting yourself against your own legitamate users. My favorite feature is cut-and-paste security, ensuring that if someone has clearance for my uber-top-secret document that they can't cut-and-paste lines from it. Uber sweet.
So if your one of the many people asking "What about Trusted Solaris?", stop asking and start playing. Download the latest SX:CR ISO's and enjoy all the hardcore security goodness. Frankly, this is perhaps one of the very few features of Solaris that I can't see using myself but certainly have great appreciation and respect for.
