How the Blogosphere actually works

A lot of people know that I’m building a company that processes the blogosphere and I get a LOT of questions about how the blogosphere works, specifically I get asked a lot about how my company and what we do is different than Google. Since it’s such a common question, I thought I would explain how the blogosphere actually works, in lay terms. This post will also serve as the introductory/background post to several posts I want to do about the problems in the blogosphere and the need for changes in the way things work.

First, I need to cover briefly how Google and other web spidering systems work. Basically, Google, Yahoo, Ask, etc use spider bots that download an html (web) page and process the entire page. They do a lot of things in that process, including identifying and logging all the links that are on that page. This allows them to discover yet more pages on the internet. By perpetually logging links and downloading the pages that they link to, they are able to continually discover new content on the internet. They do lots of other things with both the links and the content they find, but this knowing this much will suffice for our purposes right now.

The blogosphere is processed much the same way, but completely differently.

Whenever someone publishes a blog post, it sends out a notification called a Ping to a server on the internet called a Ping Server. A Ping is simply a notification to a ping server that tells it that a new blog post was published on a blog. The Ping Server is a computer on the internet that is configured to just sit and wait for a Ping to come in. That’s pretty much it – as far as your particular blog is concerned. After the Ping has been sent you don’t have to do anything else.

The rest comes in on the other side of the Ping Server. Ping Servers not only listen for Pings, but they also provide a Ping Feed which is an XML file that lists all the Pings that have been received in the last X amount of time.

If you want to process the blogosphere you have to continually download the latest updates from the Ping Servers in a file generally called a changes.xml file. The Changes.xml file is an XML file that has a long list of Pings in it. Depending on the Ping Server, there can be upwards of 30,000 pings in the file per hour. Once the changes.xml file is downloaded, you have to go through each Ping in the file and record it for processing.

Each Ping identifies the url of the blog and/or the url of the RSS file for the blog. Now that you have the Ping, you have to go to the blog and download its RSS feed. The feed can contain anywhere from 1 to 100 posts, if you’ve downloaded the RSS Feed before, you now have to determine which posts in the feed are new and which ones you’ve downloaded before. Once you’ve figured that out, you can extract just the new posts from the RSS feed and store them, or whatever it is you are planning to do with the post (in our case, this is the point where we do pretty much the same thing as Google does).

That’s pretty much it. There are multiple Ping Servers on the internet and different blogs Ping different Ping Servers. In order to find out about all the different new blog posts, you’ve got to process multiple Ping Feeds at the same time in order to keep up. There are generally around 30 – 100 Pings per SECOND, which adds up to around 1.5 million Pings per day. Each RSS feed is around 5 – 100k, which adds up to about 1 – 3 gigabytes per day or several terrabytes per month. Every time you get a Ping from a blog, you have to download their entire RSS feed (again) in order to get 1 post out of it – and then you throw away the rest of the file you just downloaded.

The Evil Spammers are out in force in the Ping-O-Sphere, about 50% – 80% of the Pings that come through are garbage which, unless you want to just have a ton of garbage in your post database, you have to filter and throw away.

But, I’m begining to leak into my next post about the architecture of the blogosphere. For now, that is pretty much how the blogosphere works and what it takes to be able to process it. In the future, I’m going to go into some of the architectural problems with the way the blogosphere has been put together and talk about some of the things that (I think) need to be done to make it more efficient.

Any questions?

Phil801 – Geek Blog

“A Geek’s look at Geeky stuff”

Friends

Archives

Leave a Reply Cancel reply

Recent Posts

Categories

Copyright