Archive for July, 2008

AWS seems down!

It seems the most important of the Amazon’s webservices are down. EC2 (the Elastic Computing), S3 (storage) and SQS (the Queing service) are all reporting red on CloudStatus. Hope it comes up soon and i’m very curious about the cause.

AWS down

Goodwood Festival of Speed 2008

Last week me and my good friend Quintijn (with whom I also did Lasvegas2Miami) went to the famous Goodwood Festival of Speed. A few things we learned: the city Brighton sucks a lot (it’s old, everything is broken and the hotels are the worst in the world), old cars can be really fast too (we saw 80+ year old cars with over 200 hp doing extreme speeds), dragcars are really noisy, Brighton sucks a lot (did i say this already??) and the BMW 650i Cabrio which was provided by my BMW dealer Hans Severs drove excellent!
The video below shows it all. Quintijn made a lot of pictures, you can find them on my flickr account.

Update: I’m putting on a new version of the video, this one doesn’t look to good and has trouble running smoothly. Probably because of the high resolution original. I’m putting up a downscaled one right now.

One thing is very important to mention, since i don’t want other people to have this same problem. THE HILTON HOTEL IN BRIGHTON IS THE WORST HOTEL IN THE WORLD. Everything is dirty, rooms are extremely small. I’m not kidding, i’ve been to a lot of hotels and hostels and even the worst hostels in the smallest an dirtiest villages in Australia didn’t even come close to this hotel. The even worst part is still to come, we had to pay 130 pounds per night per room!! Don’t EVER go there, not only will you be very dissapointed, you’ll also be ripped of. To make my point I included some pictures of my room below.

photo4

photo3

photo2

photo1

Website migration SEO and Performance Checklist

Everybody has seen it happen before, just after putting live a new (version of your) site, undoing the whole thing becomes necessary or quickfixes need to be applied. The cause is mostly because of a drop in visitors (somehow your ranking dropped?) or the new version won’t perform at all. There is a way to prevent al this! This  checklist, made by Joost de Valk (Onetomarket) and Eelco van Beek (IC&S) is meant help companies and individuals that are planning to lauch a new site.

Redirect old URL’s
In most cases a new website comes with new URL’s. This should not be a problem, however since a lot of people, weblogs and searchengines have linked to the old URL’s, a lot of the content won’t be found anymore. All those old URL’s should therefore be 301 redirected to the new URL’s. If you don’t do this all value built up in the old URL’s will be lost. There are a lot of examples of websites that lost 70-80% of their traffic due to misdirecting after a migration.

Rankings on important keywords
A new site usually means a new style and new texts. In some cases even new productnames. By changing texts and renaming pages you could easily lose your rankings in the searchengines on terms that caused very relevant traffic. Always check your analytics before you change your productnames and make sure that you’re not losing any rankings.

Hosting in the right country
It becomes more and more important for searchengines that a certain domain is hosted in the country which the domain is meant for. This also delivers the best experience for the users. So don’t start moving to another country for hosting based on price when migrating to a new site. What seems in-expensive might turn very expensive.

Don’t change your whois
It is not a good idea when your site and hosting is completely changed to update the whois (ownership information on a domain) information on your domain. Searchengines have the awkward attitude to mark a site on which everything changes at once as being sold. When that happens searchengines feel the need to rebuilt your site’s rankings completely, deleting all of the former ranking information. This is something you really want to prevent.

Measuring is making sure
It’s essential to measure site capacity before going-live. There are a lot of tools (like Jmeter) that can simulate users on your site. By running a lot of those simulations at once and meanwhile monitor the platform and measure all kinds of loads the maximum number of users on the site can be obtained.
This information can be used to setup a scaling plan to be able to scale up when 70 or 80% of the max capacity is reached. Watch Out! often, mostly on big sites, the traffic generated by searchengine spider bots is neglected in this capacity measurement. When deploying a new site version this traffic can reach up to the spidering of thousands of pages per hour. To be in control of this process it generally a good idea to A implement this in the capacity testing and B load balance the indexing searchengines by setting a crawl delay.

Are users visible in the site performance stats? Scale up!
It’s always a good idea to actively measure the user experience (in terms of effective speed) of your site. This way critical delays (DNS resolving, bandwidth shortage, network latency, connection count or HTML) can be identified. For example: a certain site shows a different response time between working days and weekends. This site is know to have more users on working days. The number of users are of direct influence of the site response times so it is time to scale up the site. The number of users (while not reaching the maximum number of users on a site) may never be of influence on the response times for a given site. An example of such a performance measurement can be found here (we do these for our customers at IC&S).

There is always something cacheable
Caching is storing (pieces of) websites or pages in a intermediate infrastructure. When a piece of content that has traversed the cache is requested for a second time, and the content is know to not have changed in the meantime, the cache will return it, instead of the actual application layer. This way the application layer can be offloaded which results in more capacity. Cacheable items can be found in all websites and -applications. This also goes for dynamic content which is specifically generated for a user.Take for example a forum. A forum is static until messages or comments are added, changed or deleted. When that happens, the cache can be instructed to delete that specific part of content so that it will be reloaded next time it is requested. Caches can be instructed to cache all kinds of data for example based on the content-type (images, video, static texts) or certain headers. For most of the platforms we support we cache about 70 - 90% of all content!

Put your servers as close as possible to your users
In the Netherlands there’s a popular data exchange point called the AMSIX. This exchanges connects internet content providers and access providers (the users) to put them (in network terms) as close to each other as possible. The golden rule is that the shorter (measured by the actual distance and the number of nodes in between) the path between the supplier of data and the receiver of data the faster the actual transfer. Sites in The Netherlands which have their users also in The Netherlands therefore should use the AMSIX. Global sites should think of using CDN’s, Content Delivery Networks. A Content Delivery Network is located in many countries and is able to delivery your content at a much higher speed. They do this by logically chopping up your site into cacheable parts or locally processable parts and putting that on their edges (which are servers placed in different countries). They also do not use the regular internet infrastructure to connect to these edges but instead use a direct (faster) network (their own backbone) connected to these edges.

There is always something cachable, also on the backend
Many websites use centralized data storage, for example a database. Nowadays almost all databases support querycaching; when the same query is executed a couple of times and the dataset hasn’t changed the query won’t be really executed. Instead the resultset will be retrieved from a cache. This speeds up a lot, the database engine gets less queries and information is retrieved blazingly fast. When using a database another thing to take care of is the connection pooling feature between the database and the frontside (php, asp, java etc). Creating a connection is a very expensive (in terms of load and time) process. Connections should be re-used in a connection pool.

HTML or HMTL
The order of your HTML source code is important. Is javascript used in the site? Be aware that javascript will block loading of other components in the HTML until after the loading of components in the javascript. Tools like Pagetest but also Yslow on Firebug provide (among a lot of other interesing information) detailed information about pageloading and blocking. The blocking issue in javascript can be handled by using certain arguments. Problem is though that these arguments are browser specific. So you might need to implement browser specific javascript.

Video? Not too fast
A video should played at a certain bits per second speed. This is called the bitrate. If the video is not played at this speed it will brake up and won’t show correctly. This speed is dependend on the format, codec and quality of the video. Make sure that when a video is requested the returns the video at a speed which is a little above the video bitrate speed. Else the video could be downloaded at the speed of the requester (which with DSL is already 20 mbits/sec). An example: we’ve got a server which provides video’s. The server is connected at a 100 mbit/sec connection. This means that 10 users with a 10 mbit/sec dsl connection fill up the connection completely. The video however has a bitrate of 256 kbit/sec. By sending the video with a speed of 350 kbit/sec we we can service about 290 users instead of the 10 just mentioned. Be aware, these users will use up the connection for a longer period.

Is the site cloud-ready?
The newest hype in internet infrastructure land is cloud computing. A cloud is a large number of connected computers on which virtual machines can be created on the fly. You pay as you use (bandwidth and cputime). Big advantage of this approach is that you won’t need to invest a lot of money in a complete infrastructure which in capacity terms will be designed to handle the peak load of the platform. With proper configuration and application tuning a site can run on one virtual machine, but, when traffic increased, is able to spawn new instances of the same virtual machine to cope with the load (autoscaling). This is a very interesting development for internet sites which are not sure how populair they will be; investments are low and, with a good revenue model, you’ll earn more as your costs increase (because costs will only increase with the increase of users). Animoto is based upon such a platform and runs on the Amazon EC2 cloud.

A whale on the beach

So, this morning when walking up to the car to drive our son to the child daycare center (we had to go to work) we bumped into our neighbour who as he said so himself was “on his way to check out this whale on the beach”. I live in Scheveningen, the coolest beach village in the Netherlands. Since we’re at a beach it rarely happends that special animals wash ashore. In my opinion it’s always a sad happening but i was very curious about the whale itself. How big would it be? Would it be decomposing (you always hear those stories about whales blowing up because of the gas buildup inside)?

P1010928

So, we got there. There was a very fishy smell and the whale looked kind of real. Since we didn’t have anything to compare to (it was our first dead whale on the beach moment) we figured it was probably real.

Back at the office, Marloes started searching online and as it turned out, the whale was an art project from some Belgium guy who wanted to see how many people would think it would be real. A little to morbid for my taste.

You can check out some more pictures at my flickr account.

Testdriving the corvette

I was browsing through my video sources yesterday and i found something i totally forgot to put online: our corvette testdrive! Enjoy!

Walking through Muir national moment, commando style!

The movie says it all!

Update: Arghhh.. why did they pick this frame as startframe?!!!

Transferring my digital live to Google

I should actually be ashamed.. a long long time ago I designed and created my own e-mail system called DBmail. The general idea was that e-mail and e-mail provisioning is structured data and structured data should be stored in a database. This way maintenance, searching, scalability, backup and data consistency should be better covered than using a filesystem for storage (all depending on the database of course, but DBmail supports many). This, combined with the fact that I have easy access to my own online servers should have actually resulted in me using DBmail for all my e-mail needs. Well, since a couple of days, this is no longer the case.

A few months ago I talked my dad into transferring his e-mail (and domain) to Google by subscribing to Google Apps. He seems very happy since then (i’m measuring by the number of help requests me and my brothers are receiving). Since i was doing my own e-mail i was also doing my own spamfighting which started to take a lot of my time. FYI, my domain eelco.com is receiving about 20.000 unwanted messages per day. My wife’s domain marloes.info does about the same so i need to block about 40.000 messages per day just for my wife and me. Then i’m also hosting e-mail for a couple of domains for friends, so add another 20.000 to that. If i’m not blocking those messages they (the friends and family) start telling me i’m letting to much spam through. An effective remedy is called greylisting. Greylisting uses a feature of the SMTP protocol to request a resend when a message is being delivered the first time from an unique sender. If the sender is using a regular fully SMTP compliant mailserver the mail will be resend. If the sender is a spammer he probably uses a bulk sender which does not fully comply with SMTP and therefore will not send the message again. So the spam message is blocked. Problem with this approach is that a lot of messages that are sent for the first time will have a arbitrary delay.. which kind of sucks when you’re waiting for a certain subscription message to come through.

The spamfighting is actually the biggest reason i switched to Google Apps for my e-mail. Google uses it’s uge Gmail userbase to identify spam; every time you click the report spam button the Gmail system learns about your spam message and prevents it from being delivered in the future, also for other Gmail users. This works great, i’m getting even less spam then on my own server.

Another really nice feature of the Google mail platform is the search feature. It’s extremely quick and uses Google search technology to index your e-mail messages. So with this feature in mind i wanted use import my complete digital life into the Google mail datbase. Google has a nice option which uses imap to import from other mail accounts. The problem is my e-mail dates back to 1991 (still got all of it in backup) and i really wanted to be able to import that as well. First i tried a dirty hack by ‘resending’ my old e-mail to an intermediate account which i could then import using Google imap import. That didn’t work and quite a few people received really strange bounces, sorry about that folks :-)

Then i discovered the Google E-mail API method for importing e-mail. This is an XML based system which accept standard rfc822 based message in an XML envelope accompanied by a one or more labels for inclusion into the Google e-mail database. Since a lot of my old e-mail is in the Cyrus DB or Maildir format i needed to create a little script that recurses through those storage formats, identify e-mail and send them through the API into the Google db. I wanted to fresh up (actually re-learn) my Python skills so i did the whole thing in a Python script.

You can download the script (GoogleMailPy) here. It’s quite dirty but it works and it supports the Google response messages (Google doesn’t like it when a lot of e-mail is being pushed within a certain timeframe. The script takes this in to account and uses the Google Double Back strategy to Play Nice™). The script defaulty adds two labels: GoogleMailPy (to identify all mail processed by the script) and the directory path in which the original e-mail message was found (with cyrus and Maildir this is also the folder path in your e-mail so it makes kind of sense to use this as temporary label until you find a better one).

GoogleMailPy uses no external Python libs so it should work out-of-the box. I’ve put in some comments in the source if you’d like to put in some feature of your own. I’ve just imported about 32000 messages and everything seems to be working ok. Using the script is of course totally at your own risk and it’s probably stuffed with bugs :-)

Usage is easy:

1. first set the right user credentials in the script. Check for the SETTING comments (There are three, one for turning certain labeling features on or off, one for the user credentials and one for the right userpath).
2. Just call the script with a Maildir of Cyrus DB directory path.

Please let me know about your experiences in the comments of this post.
I’m now checking out the other Google APPS (especially the wordprocessor and spreadsheet). We might connect it to our new client portal at the office to generate and share reports. More on that later.