Speech Recognition must read books

Speech Theory & Fundamentals

Machine Learning

Signal Processing

Algorithms & General Computer Science

Natural Language Processing / Computational Linguistics

Speech UI Design

grep through gzip files vs tarball files

I recently came across a couple of instances, on random forums, where the user was trying to grep through a huge log file that had been tarred and then gzipped. Only then did I realized that some people out there do not know the differences and advantages between a gzip file and a tarball (actually, that particular dude didn’t know purpose of a tarball). So, for my personal benefit, I did a couple of experiments with some big log files (413 Megabytes of text file) to see what the speed vs storage benefits were when using gzip and tar.gz, all with bash (sorry tcsh users!).

A little on gzip files

A gzip file is simply a file formatted in a certain way, using a compression algorithm called DEFLATE, that makes the file smaller (read more on wikipedia) that It is best to use gzip when trying to save some space, but still want to have easy access to your files for peeking in them (see speed results of grepping through log files at bottom). The scenario where I use gzip the most is for compressing 100-200 log files, each 20-30 megs (again, see stats at bottom for space gains). An important feature of gzip is that it applies to single file (this is not a ZIP file people!) and so you will end up with 100-200 little gzip files, instead of big ones. This can be a little annoying for transferring to clients, and handling in general (notice the for loop in my bash script).

# gzip multiple files
for i in `find . -name "*.log"`; do gzip $i; done
 
# grep through all those gzip files
for i in `find . -name "*.log.gz"`;do zgrep "TIME" $i; done

A little on tarball files

The ideal situation in which to use a tarball is when you want to compress directories and data files, which you want to bundle up in one nice and tidy package for users/clients to download but still preserve some of the file system information such as permissions and directory structure (read more on wikipedia). Note here that the end product is one file, which can be extremely useful in some cases (even a necessity at times). I say a necessity because sometimes, when handing in log files through a ftp, you may want to have one package that you’ve encrypted for your client (using pgp), in which case you wouldn’t want to have 20-30 small packages all encrypted separately… In any case, to produce a tarball with a lot of logs, you can do the following under bash:

# Create a tarball
tar -czvf logs.tar.gz *.log
 
# grep through a tarball
zcat logs.tar.gz | tar -xvf - | xargs grep "TIME"

Storage vs Speed for grepping through different types of files

Some statistics on storage vs speed, while grepping in different types of files.
Type of file Size of files (Megabytes) Time for grep (Seconds)
RAW 413 248 2
GZIP 27 656 3
TARBALL 27 520 19

Notice enormous gain in space going from raw log files to gzip log files. You have a 93% reduction in size (RAW to GZIP), compared to a mere 0.05% reduction in size by going from many GZIP files to one tarball. Now, I didn’t even talk about the loss in speed when compressing. That is, of course, the most important thing to consider when dealing with files that you will need to peek through from time to time (logs are a perfect example).

In order to compare the speed for each scenario, i used a very simple bash script, which I copied here for documentation (notice that I redirect the output to a ‘toto’ file so that i don’t get anything printed on my screen). The performance of my ‘grep’ command on the RAW logs was very good, 2 seconds to find 107 336 occurences of “TIME” in the 10 logs. Now comparing this with the results of the GZIP logs and the TARBALL, 3 and 19 respectively, you can quickly see that it is extremely advantageous to use gzip for log files (Look at the little graph of the different time if you are a more visual person…). Not only the gain in storage is negligible when going from GZIP to TARBALL, but the speed at which you have access to your data is a lot slower.

echo "Testing speed of RAW"
echo "===================="
echo $(date)
for i in `find . -name "*.log"`;do grep "TIME" $i >> toto; done
echo $(date)
 
echo "Testing speed of GZIP"
echo "===================="
echo $(date)
for i in `find . -name "*.log.gz"`;do zgrep "TIME" $i >> toto; done
echo $(date)
 
echo "Testing speed of TARBALL"
echo "======================="
echo $(date)
zcat logs.tar.gz | tar -xvf - | xargs grep "TIME" >> toto
echo $(date)

Concluding remark

The are situations where a tarball is necessary (or advantageous), but, in general, to keep the size of many log files down and still be able to search through them, I recommend using gzip. Not to mention that all your favorite bash commands come in a gzip flavour (zcat, zgrep, zdiff, zmore, etc) and vi can easily read a gzip file on the fly! What more can you ask for!

Climber’s Rock

One week away from the second event of Eastern leg of the Tour de Bloc 2008/2009 season, we decided to go and pay a visit to the brand new climbing facility in Burlington, Ontario. I was told that the angles were phenomenal, and that the space was just incredible. I had to see it for myself. And before the competition day if possible.

Well, after about 14 weeks of hard work, the Climber’s Rock isn’t just open for business, it’s ready for one of the most promising competition the tour has seen in a long while! I was personally very impressed with the size of the bouldering wall, the angles, the quantity of holds and problems, and most of all, the space! The potential of this place is just incredible.

If you have the chance to go check it out before the competition, I strongly suggest you do so. Only two things to report about this gym. The holds are brand new and so it is pretty much like climbing outside in the sense that you will look a lot of skin… also, there were a lot of spinner holds. Adding to the fact that there are not many mats for protection, you should *ALWAYS* spot your partner until he/she has topped out completely. Yes, you read this correctly. Top outs. Lots of them. Go see for yourself: http://picasaweb.google.com/climbersrock/. Alright, I said enough, see you next week.

Oh, one last thing, for those of you coming via go transit, the station is directly across the road. It’s Appleby station, not Burlington. Again, Appleby station.

Auto margin shifts my page content to the left

I found myself wasting about 2 hours of my time trying to understand why only one of my pages moved about 16px to the left when switching back and forth between the different page. My problem is that I was using a template and that everything on the two pages were exactly the same. Here is an example of what I mean.

Using the famous divide-and-conquer problem solving method (ha ha!), I started to remove some text to find out the culprit paragraph, and realized that the problem was only present when I had a lot of content on the page. After a few google searches, I found a thread on sitepoint’s site that explains exactly my problem (titled: Web Page Wiggle Issue and Margin Auto Wiggles & The Vertical Scrollbar). The problem lies in the use of “auto” for centering pages, much like in the very popular “body {margin: 0 auto; width: 960px;}”. This causes the page to wiggle (move to the left) between pages where the vertical scrollbar is needed (longer content) and where it doesn’t appear (firefox adds the scrollbar as needed, whereas IE always keeps it!). From the website, I found two fixes for the problem:

You can easily fix this, using CSS, by adding this one line in your stylesheet:

html {overflow-y: scroll;}

Another fix, if you are using JQuery, is the following:

$(function(){
    $('<div/>').css({
        position: 'absolute',
        top: 0,
        width: '1px',
        height: ($(window).height() + 1).toString() + 'px'
    }).appendTo('body');
});

Barack Obama, 44th president of the United States

It is now official. The speech is coming up in about 5-10 minutes. The prediction is that Obama will have 353 electoral votes against 183 for McCain. I do have to say that McCain’s speech was VERY well put together and very well delivered. I was impressed by the speech. However, I can’t wait to hear Obama’s speech. This *IS* history happening in front of our eyes! WOW!

Barack Obama, 44th president of the United States

Page 1 of 212
© Copyright Bonuel Photography - Theme by Pexeto
Follow

Get every new post on this blog delivered to your Inbox.

Join other followers:

New items available in the print section