Life of a Software Package

From SlackWiki
Revision as of 02:30, 3 June 2009 by Erik (talk | contribs) (Copy from old)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

From the programmer's keyboard to your (Slackware) Linux system.

Introduction

This article explains how a program someone writes in one side of the world ends up being managed in your system. It's meant to be easy to understand for a novice user coming from Windows, and only requires some basic knowledge of Unix systems. Specifically, the reader should know:

  • How basic Unix permissions work.
  • How to interpret the basic output of the ls command.
  • How a command line interface works.

It only contains general ideas that could help a novice user understand the existing differences when installing software under Windows and under Unix, but no specific information about how to do it. The distribution manual will give you the specific details you need, and may be a good read after you have read this article.

From source code to machine language

Note: You do not need to run any of the commands in this section. It's enough to understand the text and part of the output.

When a programmer creates a program, it's very common to write the program in a so-called high level language. In other words, he doesn't create the program executable directly specifying the intructions for the computer to run. He writes the program in a language that allows him to represent the program structure and logic, and then that logic is translated to a language the bare machine can understand. Let's suppose someone wants to write a Hello, World program in the C programming language. This is a program that prints a message to the screen and finishes, in order to check your system can translate the language you have used to machine language and to test the basic stuff works, as well as to give the novice programmer an idea on how the high level language works. This would be a very simple Hello, World program written in C:

#include <stdio.h>

int main()
{
	printf("Hello, World!\n");
	return 0;
}

You do not need to understand that program. The above text is what is commonly known as the source code of the program. The source code of a program in C is usually spread over one or more source files, so this program could very well be stored in a plain text file called hello.c. You could view this file using a plain text editor like the Windows notepad.

This file cannot be executed directly because it's not a real program. First off, it doesn't have execution permissions, and if we tried to give it execution permissions and run it, we would get an error message:

$ ls -l hello.c
-rw------- 1 rg3 users 74 2007-10-15 19:27 hello.c
$ chmod +x hello.c
$ ./hello.c
./hello.c: line 3: syntax error near unexpected token `('
./hello.c: line 3: `int main()'

We need to use a program called compiler to get a binary and executable file. A file in a specific format that the operating system (Linux in our case) can understand. When you run it, the operating system reads the file, copies the different program components to the computer's memory and starts the program execution. If our machine has everything ready to compile C programs and we want to use the GNU C Compiler (gcc), we could do something as simple as:

$ gcc -o hello hello.c

And we would get a file named hello which is our program ready to be run. As you see, our program is simply called hello and not hello.exe, which would be a common name if we were working under Windows. In Unix systems, the convention is that programs do not have any file extension in their name (like EXE). We could then run the program and see that it does what we wanted.

$ ./hello
Hello, World!
$

Complex programs use libraries

Most programs do a more sophisticated task than printing a message on the screen and finishing. The source code of the program above has 7 lines. It is not uncommon for a simple program to have thousands of lines, and there are a good amount of complex programs out there that have millions of lines of source code. It is also a very common practice to use items called libraries (usually shared libraries) to build your program. Libraries are files that contain the machine code to perform several different tasks. For example, let's suppose you were going to create a program that needs to download some data from the Internet, via HTTP (the web) or via FTP or another network service. And also let's suppose you don't have much knowledge on how to create a program that talks to others using the network, or that the focus of your program is on solving some other problem and you don't want to lose time or create a lot of code just because you want it to be able to download a file. Fortunately for you, there is a library called libcurl that makes retrieving files over the network very easy. The library contains all the code you need, so the source code you are going to create will not contain anything specific to be able to use the network. You simply indicate that you want to use libcurl and call the library functions everytime you want to download a file. Do you need to learn to sail or fly a plane or a different language if you want to send a letter to a friend in a different continent? No, you put the letter in the mailbox and the postal service does it for you. Libraries work like this.

In the moment you decide to do this, your program starts depending on libcurl. Some pieces of the library need to be present in your system if you want to compile that program to create an executable, and some other pieces need to be present in the moment you want to run the program. Else, the program will not compile or will not be able to run. Libraries are convenient because, if managed well, you can install them in your system once and they will be used by every program that needs them. This is why libraries are regarded as a good thing or a good idea in the programming world, in most cases.

In Windows, shared libraries are called DLLs, as the library files usually have the DLL extension. In Unix, it's common for them to have the .so suffix, or some other containing it. For example, I have libcurl in my system, and the shared library is located in the file /usr/lib/libcurl.so.4.0.0.

Libraries and programs all over the place need a package manager

So your system is going to be populated by a lot of programs, many of them using many different libraries for different tasks, some of them having some libraries in common, others having nothing in common. As you can guess, this situation can evolve into a pretty chaotic system. Let's describe how Windows did this in the past, and how Unix systems have been trying to handle the situation for a good amount of years now.

In Windows, most people distribute programs already compiled. You get a group of files or a single file that holds your program already prepared to be run. You extract those contents and place them somewhere in your hard drive, usually all of them under a specific directory (folder). You could then create some shortcuts in the start menu and the program is ready to be run. A installer program usually does all of this for you, asking some questions. What happens when the program uses a library? If the library is not very common and cannot be assumed to exist in a standard Windows installation, the common practice is to include a copy of the library with the program. If it's a relatively uncommon library, the installer usually puts it in the same place as the program, and when it is run and requests the library, the system first looks in the folder holding the program and finds the library there, and starts to use it. If it's not an uncommon library but you need a specific version of it, the installer may try to install it in a common place so all the programs can use it. It was very typical, when you had a system in which you had installed a lot of software as time passed, that the installer would ask you "I am trying to install the following library, but it appears to be present in your system in a newer version. Do you want me to replace the copy of it with my copy or do I leave it as it is?". And, in the same line, when you removed the program it would say "I was going to remove this library from the common place, but other programs may want to use it. Can I remove it or should I keep it there?". This chaos was called "DLL Hell".

Unix tries to avoid this problem in several ways, and its solutions bring the need of a package manager as we will explain. First off, in Unix the files on your hard drive are not grouped by program, but by their function. All binaries are stored in two or three folders, and the same for libraries or help documents. If all the documentation and help for the different programs is installed in a common location, it's easier to create a help system from which you can browse the documentation of every program installed on the computer, for example. This is generally considered a good idea and it's the tradition, but of course the idea has its detractors. Anyway, a second difference is that in Unix programs are distributed alone. If a program needs a library to be run, it needs to specify that somewhere, but only under special circumstances it's recommended to include the library as part of the program. In most cases, the library is distributed apart. Those two differences avoid the DLL hell. By installing libraries, programs, documentation and other data to common system locations, you avoid duplicating data. If there's a security problem using a library and every program using it could become compromised and make the system vulnerable, you update the library once and all the programs that use it are automatically protected, as each program doesn't include its own copy in its directory. By separating the programs from the libraries they use and distributing them apart, you make sure programs do not overwrite the libraries used by others or remove common libraries when they are removed.

However, the solution itself brings some new problems. For example, if a program installs files all over common system directories and I later want to remove it, how do I know which files need to be removed? And if a program requires a library to run, can I or should I specify that fact somewhere? Package managers are the answer. Under Unix, software is many times distributed as packages. Packages are groups of files that contain programs, libraries, documentation or simply data. Under Windows, to install a program many times you download an installer file, run it and the program is installed. This installer file that holds inside all the files the program needs and extracts them to the proper location could be considered a form of package, so you get an idea. Packages in Unix are usually managed by a package manager. A package manager is a program that allows you to install packages, check the list of installed packages, remove packages and many more complex tasks depending on how powerful and featureful the package manager itself is.

When you install a program using a package, the package holds the program binary, the program data and the program documentation, typically, along with information on where those files should be installed in the system, all over the place. Fortunately, prior, during or after the installation, after copying the files to your system, the package manager records somewhere the name and version of the package and the files it installed. This is the trick that allows you to later remove the package using the package manager without having to remember which files had been installed where. In addition, the package may hold information about other packages it needs installed for it to run, and this information may be used by the package manager to automatically download and install them too. Hopefully, you now start to understand the practical vision behind packages and package managers.

Too many package managers

The problem with package managers is that many Unix systems and even many different Linux distributions use many different package managers. Each one uses different package formats that cannot understand each other. Slackware uses pkgtools, Debian and Ubuntu use apt-get, Red Hat uses yum, Mandriva uses urpmi, Arch uses pacman, etc.

You are a programmer and created the Hello, World program we saw at the beginning. How do you distribute your program? You have several solutions. If you don't want people to get the source code of your program, you need to distribute the program already compiled and probably packaged. To do that, you could provide your own program to install and uninstall the package cleanly from any system, and distribute it somehow and break some rules to achieve maximum compatibility, so the program will run on many different systems and distributions. Many commercial games are shipped this way. Unreal Tournament 2004 for Linux is distributed this way, for example. You could also provide it as a package for each of a subset of supported systems. Many companies do this. They give you Debian, Red Hat, Suse and Mandriva packages for you to choose, for example, each in the proper format to be used with the package manager from that distribution. If you want to use the software under other system you are out of luck. You can try some tricks but it's not guaranteed to work.

If, on the other hand, your program is open source and you don't mind people reading the source code of your program, the common case is to avoid creating a package for anything. You simply distribute a tarball (similar to a ZIP or RAR file) containing the source code and instructions to compile it. If someone wants a Debian package to install it, someone will have to compile your program under Debian, and make a package with the result. This is a very very very common case. In fact, distributions like Ubuntu or Debian heavily rely on package repositories, network locations from where you can download thousands of packages for your system, created by a myriad of official and unofficial packagers (people that create packages for the system). For example, if you want to install a program under Ubuntu, it's very infrequent for the program not to exist already packaged in a repository, and you can download and install it, together with its dependencies, in a couple of mouse clicks.

Summing up

  • Many times the programmer creates programs using source code that must be compiled.
  • They use libraries to make writing programs easy.
  • Distributing the resulting program is easier using a package manager.
  • Many programmers only give you the source code due to the diversity of package managers and systems.
  • Someone else is responsible for creating a package for a specific system.

Slackware specifics

Slackware is, as you may know, a very simple system. Being simple doesn't mean it's simple to use. On the contrary, a system with a simple design and simple tools usually requires the user to do more things to achieve a goal. The advantage of a simple design is that it's easier to understand if you want to know how your system works, and sometimes it's also more stable and has less bugs. As part of its simple design, the package manager in Slackware is also very simple, and its packages are also very simple. Slackware packages are tarballs (again, something like ZIP files) that, if extracted in the right place, will populate the system with the package files, and it also holds some special files with information about the package itself. As they are simple tarballs, Slackware doesn't try to hide this fact, and Slackware packages have the tgz extension (short for tar.gz), contrary to other systems in which packages have a special extension to make it clear that they are packages, like rpm or deb.

This is not a problem, but sometimes this confuses novice users. They go to the program webpage and download the program source code in a tarball (usually a file with tar.gz extension) and think "Hey, if Slackware packages are tarballs and this is a tarball, I'm going to install this file with the package manager". Wrong! Even when the package manager complains that the package name does not end in tgz but on tar.gz, they many times rename the file and try again. Those are two mistakes in a row. The package manager will try to extact the tarball contents to that special location we mentioned earlier and nothing will happen, as the files inside the tarball are not structured as they need to be, but this is the small problem. The big one is that what you are trying to install is the source code of the program, and not the program itself! Remember, you need to compile it first in the majority of cases.

Under Slackware, you should first check if there is an official package for the program. If there is not, you could try to to download a ready to use package from a place or someone you trust. Else, you could compile the program yourself and create a package for it, and then install the package. The compilation and package creation can be automated sometimes for ease of use, for example using SlackBuild scripts.