Text 7912, 815 rader
Skriven 2006-11-11 09:34:04 av Robert Wolfe (1:261/1)
Ärende: Archiving and Compression
=================================
Archiving and Compression
By Scott Granneman
Created 2006-10-23 01:00
Chapter 8 from Scott Granneman's new book \"Linux Phrasebook: The Pocket Guide
Every Linux User Needs\". Linux Phrasebook offers a concise reference that,
like a language phrasebook, can be used \"in the street.\" The book goes
straight to practical Linux uses, providing immediate solutions for day-to-day
tasks.
Chapter 8: Archiving and Compression
Although the differences are sometimes made opaque in casual conversation,
there is in fact a complete difference between archiving files and compressing
them. Archiving means that you take 10 files and combine them into one file,
with no difference in size. If you start with 10 100KB files and archive them,
the resulting single file is 1000KB. On the other hand, if you compress those
10 files, you might find that the resulting files range from only a few
kilobytes to close to the original size of 100KB, depending upon the original
file type.
Note - In fact, you might end up with a bigger file during compression! If the
file is already compressed, compressing it again adds extra overhead, resulting
in a slightly bigger file.
All of the archive and compression formats in this chapter - zip, gzip, bzip2,
and tar - are popular, but zip is probably the world's most widely used format.
That's because of its almost universal use on Windows, but zip and unzip are
well supported among all major (and most minor) operating systems, so things
compressed using zip also work on Linux and Mac OS. If you're sending archives
out to users and you don't know which operating systems they're using, zip is a
safe choice to make.
gzip was designed as an open-source replacement for an older Unix program,
compress. It's found on virtually every Unix-based system in the world,
including Linux and Mac OS X, but it is much less common on Windows. If you're
sending files back and forth to users of Unix-based machines, gzip is a safe
choice.
The bzip2 command is the new kid on the block. Designed to supersede gzip,
bzip2 creates smaller files, but at the cost of speed. That said, computers are
so fast nowadays that most users won't notice much of a difference between the
times it takes gzip or bzip2 to compress a group of files.
Note - Linux Magazine published a good article comparing several different
compression formats, which you can find at
www.linux-mag.com/content/view/1678/43/.
zip, gzip, and bzip2 are focused on compression (although zip also archives).
The tar command does one thing - archive - and it has been doing it for a long
time. It's found almost solely on Unix-based machines. You'll definitely run
into tar files (also called tarballs) if you download source code, but almost
every Linux user can expect to encounter a tarball some time in his career.
Archive and Compress Files Using zip
zip
zip both archives and compresses files, thus making it great for sending
multiple files as email attachments, backing up items, or for saving disk
space. Using it is simple. Let's say you want to send a TIFF to someone via
email. A TIFF image is uncompressed, so it tends to be pretty large. Zipping it
up should help make the email attachment a bit smaller.
Note - When using ls -l, I'm only showing the information needed for each
example.
$ ls -lh
-rw-r--r-- scott scott 1006K young_edgar_scott.tif
$ zip grandpa.zip young_edgar_scott.tif
adding: young_edgar_scott.tif (deflated 19%)
$ ls -lh
-rw-r--r-- scott scott 1006K young_edgar_scott.tif
-rw-r--r-- scott scott 819K grandpa.zip
_grandpa.zip
In this case, you shaved off about 200KB on the resulting zip file, or 19%, as
zip helpfully informs you. Not bad. You can do the same thing for several
images.
$ ls -l
-rw-r--r-- scott scott 251980 edgar_intl_shoe.tif
-rw-r--r-- scott scott 1130922 edgar_baby.tif
-rw-r--r-- scott scott 1029224 young_edgar_scott.tif
$ zip grandpa.zip edgar_intl_shoe.tif edgar_baby.tif young_edgar_scott.tif
adding: edgar_intl_shoe.tif (deflated 4%)
adding: edgar_baby.tif (deflated 12%)
adding: young_edgar_scott.tif (deflated 19%)
$ ls -l
-rw-r--r-- scott scott 251980 edgar_intl_shoe.tif
-rw-r--r-- scott scott 1130922 edgar_baby.tif
-rw-r--r-- scott scott 2074296 grandpa.zip
-rw-r--r-- scott scott 1029224 young_edgar_scott.tif
It's not too polite, however, to zip up individual files this way. For three
files, it's not so bad. The recipient will unzip grandpa.zip and end up with
three individual files. If the payload was 50 files, however, the user would
end up with files strewn everywhere. Better to zip up a directory containing
those 50 files so when the user unzips it, he's left with a tidy directory
instead.
$ ls -lF
drwxr-xr-x scott scott edgar_scott/
$ zip grandpa.zip edgar_scott
adding: edgar_scott/ (stored 0%)
adding: edgar_scott/edgar_baby.tif (deflated 12%)
adding: edgar_scott/young_edgar_scott.tif (deflated 19%)
adding: edgar_scott/edgar_intl_shoe.tif (deflated 4%)
$ ls -lF
drwxr-xr-x scott scott 160 edgar_scott/
-rw-r--r-- scott scott 2074502 grandpa.zip
Whether you're zipping up a file, several files, or a directory, the pattern is
the same: the zip command, followed by the name of the Zip file you're
creating, and finished with the item(s) you're adding to the Zip file. Get the
Best Compression Possible with zip
-[0-9]
It's possible to adjust the level of compression that zip uses when it does its
job. The zip command uses a scale from 0 to 9, in which 0 means "no compression
at all" (which is like tar, as you'll see later), 1 means "do the job quickly,
but don't bother compressing very much," and 9 means "compress the heck out of
the files, and I don't mind waiting a bit longer to get the job done." The
default is 6, but modern computers are fast enough that it's probably just fine
to use 9 all the time.
Say you're interested in researching Herman Melville's Moby-Dick, so you want
to collect key texts to help you understand the book: Moby-Dick itself,
Milton's Paradise Lost, and the Bible's book of Job. Let's compare the results
of different compression rates.
$ ls -l
-rw-r--r-- scott scott 102519 job.txt
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 508925 paradise_lost.txt
$ zip -0 moby.zip *.txt
adding: job.txt (stored 0%)
adding: moby-dick.txt (stored 0%)
adding: paradise_lost.txt (stored 0%)
$ ls -l
-rw-r--r-- scott scott 102519 job.txt
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 1848444 moby.zip
-rw-r--r-- scott scott 508925 paradise_lost.txt
$ zip -1 moby.zip *txt
updating: job.txt (deflated 58%)
updating: moby-dick.txt (deflated 54%)
updating: paradise_lost.txt (deflated 50%)
$ ls -l
-rw-r--r-- scott scott 102519 job.txt
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 869946 moby.zip
-rw-r--r-- scott scott 508925 paradise_lost.txt
$ zip -9 moby.zip *txt
updating: job.txt (deflated 65%)
updating: moby-dick.txt (deflated 61%)
updating: paradise_lost.txt (deflated 56%)
$ ls -l
-rw-r--r-- scott scott 102519 job.txt
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 747730 moby.zip
-rw-r--r-- scott scott 508925 paradise_lost.txt
In tabular format, the results look like this: Book zip -0 zip -1 zip -9
Moby-Dick 0% 54% 61%
Paradise Lost 0% 50% 56%
Job 0% 58% 65%
Total (in bytes) 1848444 869946 747730
The results you see here would vary depending on the file types (text files
typically compress well) and the sizes of the original files, but this gives
you a good idea of what you can expect. Unless you have a really slow machine
or you're just naturally impatient, you should just use -9 all the time to get
the maximum compression.
Note - If you want to be clever, define an alias in your .bashrc file that
looks like this:
alias zip='zip -9'
That way you'll always use -9 and won't have to think about it.
Password-Protect Compressed Zip Archives
-P
-e
The Zip program allows you to password-protect your Zip archives using the -P
option. You shouldn't use this option. It's completely insecure, as you can see
in the following example (the actual password is 12345678):
$ zip -P 12345678 moby.zip *.txt
Because you had to specify the password on the command line, anyone viewing
your shell's history (and you might be surprised how easy it is for other users
to do so) can see your password in all its glory. Don't use the -P option!
Instead, just use the -e option, which encrypts the contents of your Zip file
and also uses a password. The difference, however, is that you're prompted to
type the password in, so it won't be saved in the history of your shell events.
$ zip -e moby.zip *.txt
Enter password:
Verify password:
adding: job.txt (deflated 65%)
adding: moby-dick.txt (deflated 61%)
adding: paradise_lost.txt (deflated 56%)
The only part of this that's saved in the shell is zip -e moby.zip *.txt. The
actual password you type disappears into the ether, unavailable to anyone
viewing your shell history.
Caution - The security offered by the Zip program's password protection isn't
that great. In fact, it's pretty easy to find a multitude of tools floating
around the Internet that can quickly crack a password-protected Zip archive.
Think of password-protecting a Zip file as the difference between writing a
message on a postcard and sealing it in an envelope: It's good enough for
ordinary folks, but it won't stop a determined attacker.
Also, the version of zip included with some Linux distros may not support
encryption, in which case you'll see a zip error: "encryption not supported."
The only solution: recompile zip from source. Ugh. Unzip Files
unzip
Expanding a Zip archive isn't hard at all. To create a zipped archive, use the
zip command; to expand that archive, use the unzip command.
$ unzip moby.zip
Archive: moby.zip
inflating: job.txt
inflating: moby-dick.txt
inflating: paradise_lost.txt
The unzip command helpfully tells you what it's doing as it works. To get even
more information, add the -v option (which stands, of course, for verbose).
unzip -v moby.zip
Archive: moby.zip
Length Method Size Ratio CRC-32 Name
------- ------ ------ ----- ------ ----
102519 Defl:X 35747 65% fabf86c9 job.txt
1236574 Defl:X 487553 61% 34a8cc3a moby-dick.txt
508925 Defl:X 224004 56% 6abe1d0f paradise_lost.t
------- ------ --- -------
1848018 747304 60% 3 files
There's quite a bit of useful data here, including the method used to compress
the files, the ratio of original to compressed file size, and the cyclic
redundancy check (CRC) used for error correction. List Files That Will Be
Unzipped
-l
Sometimes you might find yourself looking at a Zip file and not remembering
what's in that file. Or perhaps you want to make sure that a file you need is
contained within that Zip file. To list the contents of a zip file without
unzipping it, use the -l option (which stands for "list").
$ unzip -l moby.zip
Archive: moby.zip
Length Date Time Name
-------- ---- ---- ----
0 01-26-06 18:40 bible/
207254 01-26-06 18:40 bible/genesis.txt
102519 01-26-06 18:19 bible/job.txt
1236574 01-26-06 18:19 moby-dick.txt
508925 01-26-06 18:19 paradise_lost.txt
-------- -------
2055272 5 files
From these results, you can see that moby.zip contains two files -
moby-dick.txt and paradise_lost.txt - and a directory (bible), which itself
contains two files, genesis. txt and job.txt. Now you know exactly what will
happen when you expand moby.zip. Using the -l command helps prevent
inadvertently unzipping a file that spews out 100 files instead of unzipping a
directory that contains 100 files. The first leaves you with files strewn
pell-mell, while the second is far easier to handle. Test Files That Will Be
Unzipped
-t
Sometimes zipped archives become corrupted. The worst time to discover this is
after you've unzipped the archive and deleted it, only to discover that some or
even all of the unzipped contents are damaged and won't open. Better to test
the archive first before you actually unzip it by using the -t (for test)
option.
$ unzip -t moby.zip
Archive: moby.zip
testing: bible/ OK
testing: bible/genesis.txt OK
testing: bible/job.txt OK
testing: moby-dick.txt OK
testing: paradise_lost.txt OK
No errors detected in compressed data of moby.zip.
You really should use -t every time you work with a zipped file. It's the smart
thing to do, and although it might take some extra time, it's worth it in the
end.
Archive and Compress Files Using gzip
gzip
Using gzip is a bit easier than zip in some ways. With zip, you need to specify
the name of the newly created Zip file or zip won't work; with gzip, though,
you can just type the command and the name of the file you want to compress.
$ ls -l
-rw-r--r-- scott scott 508925 paradise_lost.txt
$ gzip paradise_lost.txt
$ ls -l
-rw-r--r-- scott scott 224425 paradise_lost.txt.gz
You should be aware of a very big difference between zip and gzip: When you zip
a file, zip leaves the original behind so you have both the original and the
newly zipped file, but when you gzip a file, you're left with only the new
gzipped file. The original is gone.
If you want gzip to leave behind the original file, you need to use the -c (or
--stdout or --to-stdout) option, which outputs the results of gzip to the
shell, but you need to redirect that output to another file. If you use -c and
forget to redirect your output, you get nonsense like this:
Not good. Instead, output to a file.
$ls -l
-rw-r--r-- 1 scott scott 508925 paradise_lost.txt
$ gzip -c paradise_lost.txt > paradise_lost.txt.gz
$ ls -l
-rw-r--r-- 1 scott scott 497K paradise_lost.txt
-rw-r--r-- 1 scott scott 220K paradise_lost.txt.gz
Much better! Now you have both your original file and the zipped version.
Tip: If you accidentally use the -c option without specifying an output file,
just start pressing Ctrl+C several times until gzip stops. Archive and Compress
Files Recursively Using gzip
-r
If you want to use gzip on several files in a directory, just use a wildcard.
You might not end up gzipping everything you think you will, however, as this
example shows.
$ ls -F
bible/ moby-dick.txt paradise_lost.txt
$ ls -l *
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 508925 paradise_lost.txt
bible:
-rw-r--r-- scott scott 207254 genesis.txt
-rw-r--r-- scott scott 102519 job.txt
$ gzip *
gzip: bible is a directory -- ignored
$ ls -l *
-rw-r--r-- scott scott 489609 moby-dick.txt.gz
-rw-r--r-- scott scott 224425 paradise_lost.txt.gz
bible:
-rw-r--r-- scott scott 207254 genesis.txt
-rw-r--r-- scott scott 102519 job.txt
Notice that the wildcard didn't do anything for the files inside the bible
directory because gzip by default doesn't walk down into subdirectories. To get
that behavior, you need to use the -r (or --recursive) option along with your
wildcard.
$ ls -F
bible/ moby-dick.txt paradise_lost.txt
$ ls -l *
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 508925 paradise_lost.txt
bible:
-rw-r--r-- scott scott 207254 genesis.txt
-rw-r--r-- scott scott 102519 job.txt
$ gzip -r *
$ ls -l *
-rw-r--r-- scott scott 489609 moby-dick.txt.gz
-rw-r--r-- scott scott 224425 paradise_lost.txt.gz
bible:
-rw-r--r-- scott scott 62114 genesis.txt.gz
-rw-r--r-- scott scott 35984 job.txt.gz
This time, every file - even those in subdirectories - was gzipped. However,
note that each file is individually gzipped. The gzip command cannot combine
all the files into one big file, like you can with the zip command. To do that,
you need to incorporate tar, as you'll see in "Archive and Compress Files with
tar and gzip."
Get the Best Compression Possible with gzip
-[0-9]
Just as with zip, it's possible to adjust the level of compression that gzip
uses when it does its job. The gzip command uses a scale from 0 to 9, in which
0 means "no compression at all" (which is like tar, as you'll see later), 1
means "do the job quickly, but don't bother compressing very much," and 9 means
"compress the heck out of the files, and I don't mind waiting a bit longer to
get the job done." The default is 6, but modern computers are fast enough that
it's probably just fine to use 9 all the time.
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
$ gzip -c -1 moby-dick.txt > moby-dick.txt.gz
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 571005 moby-dick.txt.gz
$ gzip -c -9 moby-dick.txt > moby-dick.txt.gz
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 487585 moby-dick.txt.gz
Remember to use the -c option and pipe the output into the actual .gz file due
to the way gzip works, as discussed in "Archive and Compress Files Using gzip."
Note - If you want to be clever, define an alias in your .bashrc file that
looks like this:
alias gzip='gzip -9'
That way, you'll always use -9 and won't have to think about it. Uncompress
Files Compressed with gzip
gunzip
Getting files out of a gzipped archive is easy with the gunzip command.
$ ls -l
-rw-r--r-- scott scott 224425 paradise_lost.txt.gz
$ gunzip paradise_lost.txt.gz
$ ls -l
-rw-r--r-- scott scott 508925 paradise_lost.txt
In the same way that gzip removes the original file, leaving you solely with
the gzipped result, gunzip removes the .gz file, leaving you with the final
gunzipped result. If you want to ensure that you have both, you need to use the
-c option (or --stdout or --to-stdout) and pipe the results to the file you
want to create.
$ ls -l
-rw-r--r-- scott scott 224425 paradise_lost.txt.gz
$ gunzip -c paradise_lost.txt.gz > paradise_lost.txt
$ ls -l
-rw-r--r-- scott scott 508925 paradise_lost.txt
-rw-r--r-- scott scott 224425 paradise_lost.txt.gz
It's probably a good idea to use -c, especially if you plan to keep behind the
.gz file or pass it along to someone else. Sure, you could use gzip and create
your own archive, but why go to the extra work?
Note - If you don't like the gunzip command, you can also use gzip -d (or
--decompress or --uncompress).
Test Files That Will Be Unzipped with gunzip
-t
Before gunzipping a file (or files) with gunzip, you might want to verify that
they're going to gunzip correctly without any file corruption. To do this, use
the -t (or --test) option.
$ gzip -t paradise_lost.txt.gz
$
That's right: If nothing is wrong with the archive, gzip reports nothing back
to you. If there's a problem, you'll know, but if there's not a problem, gzip
is silent. That can be a bit disconcerting, but that's how Unix-based systems
work. They're generally only noisy if there's an issue you should know about,
not if everything is working as it should. Archive and Compress Files Using
bzip2
bzip2
Working with bzip2 is pretty easy if you're comfortable with gzip, as the
creators of bzip2 deliberately made the options and behavior of the new command
as similar to its progenitor as possible.
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
$ bzip2 moby-dick.txt
$ ls -l
-rw-r--r-- scott scott 367248 moby-dick.txt.bz2
Just like gzip, bzip2 leaves you with just the .bz2 file. The original
moby-dick.txt is gone. To keep the original file, use the -c (or --stdout)
option and pipe the output to a filename that ends with .bz2.
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
$ bzip2 -c moby-dick.txt > moby-dick.txt.bz2
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 367248 moby-dick.txt.bz2
If you look back at "Archive and Compress Files Using gzip," you'll see that
gzip and bzip2 are incredibly similar, which is by design. Get the Best
Compression Possible with bzip2
-[0-9]
Just as with zip and gzip, it's possible to adjust the level of compression
that bzip2 uses when it does its job. The bzip2 command uses a scale from 0 to
9, in which 0 means "no compression at all" (which is like tar, as you'll see
later), 1 means "do the job quickly, but don't bother compressing very much,"
and 9 means "compress the heck out of the files, and I don't mind waiting a bit
longer to get the job done." The default is 6, but modern computers are fast
enough that it's probably just fine to use 9 all the time.
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
$ bzip2 -c -1 moby-dick.txt > moby-dick.txt.bz2
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 424084 moby-dick.txt.bz2
$ bzip2 -c -9 moby-dick.txt > moby-dick.txt.bz2
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 367248 moby-dick.txt.bz2
From 424KB with 1 to 367KB with 9 - that's quite a difference! Also notice the
difference in ultimate file size between gzip and bzip2. At -9, gzip compressed
moby-dick.txt down to 488KB, while bzip2 mashed it even further to 367KB. The
bzip2 command is noticeably slower than the gzip command, but on a fast machine
that means that bzip2 takes two or three seconds longer than gzip, which
frankly isn't much to worry about.
Note - If you want to be clever, define an alias in your .bashrc file that
looks like this:
alias bzip2='bzip2 -9'
That way, you'll always use -9 and won't have to think about it. Uncompress
Files Compressed with bzip2
bunzip2
In the same way that bzip2 was purposely designed to emulate gzip as closely as
possible, the way bunzip2 works is very close to that of gunzip.
$ ls -l
-rw-r--r-- scott scott 367248 moby-dick.txt.bz2
$ bunzip2 moby-dick.txt.bz2
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
You'll notice that bunzip2 is similar to gunzip in another way: Both commands
remove the original compressed file, leaving you with the final uncompressed
result. If you want to ensure that you have both the compressed and
uncompressed files, you need to use the -c option (or --stdout or --to-stdout)
and pipe the results to the file you want to create.
$ ls -l
-rw-r--r-- scott scott 367248 moby-dick.txt.bz2
$ bunzip2 -c moby-dick.txt.bz2 > moby-dick.txt
$ ls -l
-rw-r--r-- scott scott 1236574 moby-dick.txt
-rw-r--r-- scott scott 367248 moby-dick.txt.bz2
It's a good thing when commands copy each other's options and behavior, as it
makes them easier to learn. In this, the creators of bzip2 and bunzip2 showed
remarkable foresight.
Note - If you're not feeling favorable toward bunzip2, you can also use bzip2
-d (or --decompress or --uncompress).
Test Files That Will Be Unzipped with bunzip
-t
Before bunzipping a file (or files) with bunzip, you might want to verify that
they're going to bunzip correctly without any file corruption. To do this, use
the -t (or --test) option.
$ bunzip2 -t paradise_lost.txt.gz
$
Just as with gunzip, if there's nothing wrong with the archive, bunzip2 doesn't
report anything back to you. If there's a problem, you'll know, but if there's
not a problem, bunzip2 is silent.
Archive Files with tar
-cf
Remember, tar doesn't compress; it merely archives (the resulting archives are
known as tarballs, by the way). Instead, tar uses other programs, such as gzip
or bzip2, to compress the archives that tar creates. Even if you're not going
to compress the tarball, you still create it the same way with the same basic
options: -c (or --create), which tells tar that you're making a tarball, and -f
(or --file), which is the specified filename for the tarball.
$ ls -l
scott scott 102519 job.txt
scott scott 1236574 moby-dick.txt
scott scott 508925 paradise_lost.txt
$ tar -cf moby.tar *.txt
$ ls -l
scott scott 102519 job.txt
scott scott 1236574 moby-dick.txt
scott scott 1853440 moby.tar
scott scott 508925 paradise_lost.txt
Pay attention to two things here. First, add up the file sizes of job.txt,
moby-dick.txt, and paradise_lost.txt, and you get 1848018 bytes. Compare that
to the size of moby.tar, and you see that the tarball is only 5422 bytes
bigger. Remember that tar is an archive tool, not a compression tool, so the
result is at least the same size as the individual files put together, plus a
little bit for overhead to keep track of what's in the tarball. Second, notice
that tar, unlike gzip and bzip2, leaves the original files behind. This isn't a
surprise, considering the tar command's background as a backup tool.
What's really cool about tar is that it's designed to compress entire directory
structures, so you can archive a large number of files and subdirectories in
one fell swoop.
$ ls -lF
drwxr-xr-x scott scott 168 moby-dick/
$ ls -l moby-dick/*
scott scott 102519 moby-dick/job.txt
scott scott 1236574 moby-dick/moby-dick.txt
scott scott 508925 moby-dick/paradise_lost.txt
moby-dick/bible:
scott scott 207254 genesis.txt
scott scott 102519 job.txt
$ tar -cf moby.tar moby-dick/
$ ls -lF
scott scott 168 moby-dick/
scott scott 2170880 moby.tar
The tar command has been around forever, and it's obvious why: It's so darn
useful! But it gets even more useful when you start factoring in compression
tools, as you'll see in the next section. Archive and Compress Files with tar
and gzip
-zcvf
If you look back at "Archive and Compress Files Using gzip" and "Archive and
Compress Files Using bzip2" and think about what was discussed there, you'll
probably start to figure out a problem. What if you want to compress a
directory that contains 100 files, contained in various subdirectories? If you
use gzip or bzip2 with the -r (for recursive) option, you'll end up with 100
individually compressed files, each stored neatly in its original subdirectory.
This is undoubtedly not what you want. How would you like to attach 100 .gz or
.bz2 files to an email? Yikes!
That's where tar comes in. First you'd use tar to archive the directory and its
contents (those 100 files inside various subdirectories) and then you'd use
gzip or bzip2 to compress the resulting tarball. Because gzip is the most
common compression program used in concert with tar, we'll focus on that.
You could do it this way:
$ ls -l moby-dick/*
scott scott 102519 moby-dick/job.txt
scott scott 1236574 moby-dick/moby-dick.txt
scott scott 508925 moby-dick/paradise_lost.txt
moby-dick/bible:
scott scott 207254 genesis.txt
scott scott 102519 job.txt
$ tar -cf moby.tar moby-dick/ | gzip -c > moby.tar.gz
$ ls -l
scott scott 168 moby-dick/
scott scott 20 moby.tar.gz
That method works, but it's just too much typing! There's a much easier way
that should be your default. It involves two new options for tar: -z (or
--gzip), which invokes gzip from within tar so you don't have to do so
manually, and -v (or --verbose), which isn't required here but is always
useful, as it keeps you notified as to what tar is doing as it runs.
$ ls -l moby-dick/*
scott scott 102519 moby-dick/job.txt
scott scott 1236574 moby-dick/moby-dick.txt
scott scott 508925 moby-dick/paradise_lost.txt
moby-dick/bible:
scott scott 207254 genesis.txt
scott scott 102519 job.txt
$ tar -zcvf moby.tar.gz moby-dick/
moby-dick/
moby-dick/job.txt
moby-dick/bible/
moby-dick/bible/genesis.txt
moby-dick/bible/job.txt
moby-dick/moby-dick.txt
moby-dick/paradise_lost.txt
$ ls -l
scott scott 168 moby-dick
scott scott 846049 moby.tar.gz
The usual extension for a file that has had the tar and then the gzip commands
used on it is .tar.gz; however, you could use .tgz and .tar.gzip if you like.
Note - It's entirely possible to use bzip2 with tar instead of gzip. Your
command would look like this (note the -j option, which is where bzip2 comes
in):
$ tar -jcvf moby.tar.bz2 moby-dick/
In that case, the extension should be .tar.bz2, although you may also use
.tar.bzip2, .tbz2, or .tbz. Yes, it's very confusing that using gzip or bzip2
might both result in a file ending with .tbz. This is a strong argument for
using anything but that particular extension to keep confusion to a minimum.
Test Files That Will Be Untarred and Uncompressed
-zvtf
Before you take apart a tarball (whether or not it was also compressed using
gzip), it's a really good idea to test it. First, you'll know if the tarball is
corrupted, saving yourself hair pulling when files don't seem to work. Second,
you'll know if the person who created the tarball thoughtfully tarred up a
directory containing 100 files, or instead thoughtlessly tarred up 100
individual files, which you're just about to spew all over your desktop.
To test your tarball (once again assuming it was also zipped using gzip), use
the -t (or --list) option.
$ tar -zvtf moby.tar.gz
scott/scott 0 moby-dick/
scott/scott 102519 moby-dick/job.txt
scott/scott 0 moby-dick/bible/
scott/scott 207254 moby-dick/bible/genesis.txt
scott/scott 102519 moby-dick/bible/job.txt
scott/scott 1236574 moby-dick/moby-dick.txt
scott/scott 508925 moby-dick/paradise_lost.txt
This tells you the permissions, ownership, file size, and time for each file.
In addition, because every line begins with moby-dick/, you can see that you're
going to end up with a directory that contains within it all the files and
subdirectories that accompany the tarball, which is a relief.
Be sure that the -f is the last option because after that you're going to
specify the name of the .tar.gz file. If you don't, tar complains:
$ tar -zvft moby.tar.gz
tar: You must specify one of the '-Acdtrux' options
Try 'tar --help' or 'tar --usage' for more information.
Now that you've ensured that your .tar.gz file isn't corrupted, it's time to
actually open it up, as you'll see in the following section.
Note - If you're testing a tarball that was compressed using bzip2, just use
this command instead:
$ tar -jvtf moby.tar.bz2
Untar and Uncompress Files
-zxvf
To create a .tar.gz file, you used a set of options: -zcvf. To untar and
uncompress the resulting file, you only make one substitution: -x (or
--extract) for -c (or --create).
$ ls -l
rsgranne rsgranne 846049 moby.tar.gz
$ tar -zxvf moby.tar.gz
moby-dick/
moby-dick/job.txt
moby-dick/bible/
moby-dick/bible/genesis.txt
moby-dick/bible/job.txt
moby-dick/moby-dick.txt
moby-dick/paradise_lost.txt
$ ls -l
rsgranne rsgranne 168 moby-dick
rsgranne rsgranne 846049 moby.tar.gz
Make sure you always test the file before you open it, as covered in the
previous section, "Test Files That Will Be Untarred and Uncompressed." That
means the order of commands you should run will look like this:
$ tar -zvtf moby.tar.gz
$ tar -zxvf moby.tar.gz
Note - If you're opening a tarball that was compressed using bzip2, just use
this command instead:
$ tar -jxvf moby.tar.bz2
Conclusion
Back in the days of slow modems and tiny hard drives, archiving and compression
was a necessity. These days, it's more of a convenience, but it's still
something you'll find yourself using all the time. For instance, if you ever
download source code to compile it, more than likely you'll find yourself
face-to-face with a file such as sourcecode.tar.gz. In the future, you'll
probably see more and more of those files ending with .tar.bz2. And if you
exchange files with Windows users, you're going to run into files that end with
.zip. Learn how to use your archival and compression tools because you're going
to be using them far more than you think.
About the Author:
Scott Granneman is a monthly columnist for SecurityFocus and Linux Magazine, as
well as a professional blogger on The Open Source Weblog. He is an adjunct
Professor at Washington University, St. Louis and at Webster University,
teaching a variety of courses about technology and the Internet.
"Linux Phrasebook" by Scott Granneman
ISBN: 0-672-32838-0
http://www.samspublishing.com/bookstore/product.asp?isbn=0672328380&rl
=1
C Copyright Pearson Education. All rights reserved.
Chapter excerpt provided by Sams Publishing an imprint of Pearson
Education
Reprinted with permission.
Links
Source URL: http://interactive.linuxjournal.com/article/9370
--- BBBS/NT v4.00 MP
* Origin: Omicron Theta (1:261/1)
|