Linux: ‘top’ shows almost all of my memory being used! Should I panic?

Top The short answer is, of course, “no.” ūüôā 1

A common mistake that Linux users make — indeed, one that even some less experienced Linux¬†sysadmins sometimes make — is treating Linux2 as if memory technology hadn’t evolved since MS-DOS, when you¬†needed¬†as much memory as possible — especially conventional RAM (anyone remember that?) — to be free, so that applications could be loaded into it and run. If there wasn’t enough free memory, then applications wouldn’t run, or would run worse, so you’d have to “optimize” your DOS system to have as much free conventional memory as possible (using tricks such as loading device drivers into upper memory, and, of course, exiting a program completely before starting another.)

Somehow, that line of thinking still exists — even, apparently, in people who never used MS-DOS or similar systems before(!). “Free memory is good,” people think. “If most of the RAM is used up, then there’s a problem,” they insist — one to be solved either by killing applications or by adding more memory to the system.

Linux — like all modern systems, and I’ll even include Windows here ūüôā 3, uses¬†virtual memory. Put simply (and, yes, I’m simplifying it a lot — there’s more to it, of course), it means that the amount of available memory is actually (mostly) independent of the physical system RAM, since the system uses the disk as memory, too. Since real RAM is (virtually) always faster — but (usually) smaller — than a disk drive, the system typically uses prefers to use it for more critical stuff (such as the kernel, device drivers, and currently actively¬†running code), with less critical stuff (say, a program waiting for something to happen) being relegated to the disk — even though it’s still “memory”. Note that even most of the aforementioned “critical stuff” can be temporarily moved to the disk if needed.

You may now be asking: if RAM is typically preferred for “critical stuff” only, and that stuff doesn’t take up the entire available memory, then what’s the rest of the RAM used for? The answer is, of course,¬†disk caching — both for reading (loading up recently and/or frequently accessed data from the disk and keeping it available in RAM for a while, in case it’s needed again in the future) and for¬†writing¬†(which works the other way around — the system tells the program the data is already written to disk, so that it can go on instead of waiting for it, but it’s still only in memory, to be written a short while later.) So, any free RAM is typically used for disk caching, with the system managing it automatically.

But people keep looking at the output of “top”, seeing 90% of the physical RAM used, and panicking. ūüôā

In fact, this confusion was so common that modern Linux systems even changed some labels on the “top” command. For instance, this is from a Red Hat Enterprise Linux 6.x:

"top" on RHEL 6.x
“top” on RHEL 6.x. Nice machine, eh? ūüôā

As seen above, the server has (rounding down) 567 GB of physical RAM, of which 506 GB¬†— i.e. 89% — are used. Looks scary (if you haven’t been reading so far), right? I mean, the server is at almost 90% capacity! But then you look at the value for “cached”, in the line below. That’s 479 GB¬†— or 84% — used for disk cache! In other words, the entire operating system and applications are running in just¬†5% of the server’s RAM, with the rest of it being used just to make disk access faster.

And that “cached” value — and this is the important bit —¬†counts as free memory. It’s available to the system whenever needs it. In “computer terms”, it’s not technically free, it’s being used for the cache, but in¬†“human terms”, it’s just as free as the one that actually says “free” in front of it.

As I said above, this caused so much confusion to users (for decades) that, eventually, it was changed. Here’s how it looks like in a RHEL 7.x system:

Top in RHEL 7.x
“top” in RHEL 7.x. Puny machine, I know.

In the first line: 79044 KB free + 240444 KB used + 697064 KB for buffers and cache (which were separated in the old version, by the way) = …¬†1016512 KB, which you’ll note is exactly the first value in the same line, for total physical memory.

In the second line, the first three values refer to swap, so they’re not relevant here, but the “avail Mem” one is new. What is it? According to the “top” manual page,

The avail number on line 2 is an estimation of physical memory available for starting new applications, without swapping. Unlike the free field, it attempts to account for readily reclaimable page cache and memory slabs.

In other words (and this is again an oversimplification), it’s what the system has¬†readily available — in other words, the “free” part, plus (most of) what is being used for caching. That doesn’t mean it can only use that much memory, it just means that that’s what it can use¬†right now, without swapping anything out. Which, in my opinion, ends up not mattering that much, in terms of a precise value. However, at least it now gives a better (though rough) idea of what part of the physical memory is¬†available, not simply “free” (i.e. not doing anything).

So, what about those specific cases mentioned (in a note) at the beginning? Well, if there is very little memory free and very little swap space free and very little memory (compared to the total amount of physical memory) is being used for buffers/cache, then and only then4 may it be the case that the server actually needs more memory and/or swap space, or that some process has some kind of memory leak or is otherwise using much more memory than usual (if restarting the service fixes it, then this is probably the case.)

Linux: What’s filling up my almost full filesystem?

Let’s say you realize (maybe because you got an alarm for it) that a particular filesystem — let’s say /qwerty — is full, or almost full, and you want to find out what’s taking up the most space. Simply enter something like:

du -k -x /qwerty | sort -n | tail -20

to get a list of the directories taking up the most space in that filesystem, sorted by size, with the largest ones at the bottom.

Note that the “/qwerty“, in this case, should be the filesystem’s top directory, not some subdirectory of it. In other words, it should be something that shows up on a “df” command.

Explanation:

du -k -x” shows the subdirectories (and their total sizes) of the specified directory (or of the current one, if you don’t specify one). “-k” means that sizes are reported in kilobytes (KB) — this is not mandatory, but different versions of “du” may use other units, and this one is easy to read. “-x” means “don’t go out of the specified filesystem”, and that’s quite important here, as you could have a “/qwerty/asdfgh” filesystem, mounted in a subdirectory of “/qwerty“, but since it’s a different filesystem (appearing separately in the results of a “df” command) you won’t want to include it in this list.

sort -n” is a simple numeric sort, and “tail -20” just means “show only the last 20 lines. 20 is a reasonable number that fits in most terminals at their default sizes (typically 80×25), so that you don’t have to scroll up.

(Yes, this is very basic, but, hey, one of the goals of this blog is to document stuff I help co-workers with, since, at least in theory, if someone has a question about something they need to do, then many other people may have the same doubt/need as well, and we have a couple of new team members whose backgrounds are not Linux/Unix-related, so…)

Networking definitions: Allowing traffic, Port forwarding, NAT, and Routing

I’m more of a¬†systems administrator than a¬†network admin (though I’ve also worked as the latter, in the past), but, of course, one can’t be a sysadmin without knowing at least¬†something about networking. And yet (and like my previous post about compiling stuff), I’ve found that it’s possible to do a perfectly good job as a sysadmin, and yet still mix up a few networking concepts from time to time.

Therefore, I wanted to write about four different, but related, concepts, that I’ve noticed people sometimes confuse.

1. Allowing traffic

This just means that a network interface allows traffic to a specific destination and/or port, possibly restricted to a specific source (an IP address, or a network). Any traffic that is not allowed is simply refused.

Note that this by itself doesn’t mean that the interface (of a server, a router, etc.) will do anything (or anything¬†desirable, at least) with the received traffic. For instance, it may not have something listening on that port, or it may not be configured to route that traffic to somewhere useful. This just means that it doesn’t instantly block the connection.

This is typically controlled by a local software firewall such as iptables.

2. Port forwarding

Port forwarding just means: if you receive a packet on interface A, port B, then redirect it to IP address C, port D. “D” may be the same as “B”, and “C” may be on the same host or at the other end of the planet.

Again, note that the mere fact that there’s a matching redirect rule for the initial destination interface and port doesn’t mean that the host actually accepts redirecting the traffic (see 1.) or that it knows how to route it to its final destination (see 4.)

3. NAT

NAT, or Network Address Translation, can be seen as a special case of two-way port forwarding (see 2.). Basically, a router (which may well be a simple server with two or more (physical or virtual) network interfaces, it doesn’t have to be a “router” bought in a store) accepts traffic from a (typically private) network, then¬†translates it so that it goes to the destination address in the (typically public) network, with (and this is the important part) the source “masked” as the router’s public IP address, and a different source port that the router “remembers”, and then knows how to handle the returning traffic and “untranslate” it so that it goes to the original source.

In short, NAT allows many private hosts in a network to access the Internet using a single public IP, and a single connection. (It can have other similar uses, even some not related to the public Internet, of course; this is just the most common one.)

4. Routing

Routing is simply a host knowing that a packet intended for IP address X should be directed to IP address Y 1.

Again, the obligatory caveats: this doesn’t mean that the traffic is even accepted before attempting to route it (see 1.), or that the host actually knows how to reach IP address Y itself (it may not have a local route to it). It may also be that the routing is correct, but there is no address translation (see 3.), so the destination host receives a packet in a public interface that claims to be from a private address, and refuses it. Finally, the destination host itself needs a route to the original source, and it may not have one configured (or have a¬†misconfigured one, causing asynchronous routing and possibly making the source refuse the returning traffic).

Thoughts? Corrections? Clarifications? Yes, I know this is relatively basic stuff. ūüôā

Linux: How to Compile Stuff

A couple of days ago I was surprised when someone who’s worked as a Linux sysadmin for several years asked me how to compile a program that was only available in source format (it was a cryptocurrency miner, for the curious), since he’d never compiled anything in his life. How was that possible? But then I realized: compiling stuff on Linux hasn’t really been a necessity 1 for the last, oh, 15-20 years or so. Back in 1993 or so (yikes!), when I started using Linux, things were of course quite different.

So, without further ado: how to compile stuff on Linux. Assuming you’ve already downloaded and uncompressed a program’s source, here’s what works 99% of the time (from the program’s directory, of course):

./configure --help  # to see available options

./configure # followed by the desired options, if any

make && make install

By default, compiled programs are then installed to /usr/local (e.g. binaries go to /usr/local/bin, libraries to /usr/local/lib , etc.). If you want to change that, just add –prefix=path (e.g. –prefix=/usr/local/test) to the ./configure options. (In that example, it’ll install binaries to /usr/local/test/bin, and so on. There are ways (i.e. different ./configure options) to install to different sub-paths, but those are beyond this basic tutorial. The default /usr/local/bin, lib, etc. paths also have the advantage of being in the default executable and library paths in typical distributions, so that you don’t have to add to those.)

If the above commands worked for you, congratulations, you’ve compiled your first program, which should now be ready to run! ūüôā However, if you’ve never compiled anything before on that system, it’s more likely that the ./configure command complained about missing libraries, missing utilities, or even a missing compiler; after all, as the first paragraph showed, it’s possible to use Linux for years, even work with it for a living, without needing to compile anything, which in turn has caused most modern distributions to¬†not install compilers and development stuff by default.

How to do so? On Debian/Ubuntu, start with:

apt-get install build-essential

And on Red Hat/CentOS/Fedora, use:

yum groupinstall¬†“Development Tools”

The above should install the C/C++ compiler and common development tools. However, a particular program may ask for other libraries’¬†development files, which may not be installed even though the library itself is. It’s impossible to give you an exhaustive list, but typically you can use yum search or apt-cache search to find what you need. Debian-based development packages typically end in -dev, while Red Hat-based ones end in -devel . For instance, suppose you’re on Ubuntu and ./configure complains about not having the zlib development files. A quick search would point you (at least on Ubuntu 17.04) to¬†zlib1g-dev, which you’d install with apt-get install¬†zlib1g-dev .

And what if there is no configure script (unlikely these days, but possible if it’s a very old source)? If there is a file named Makefile, then try typing make, and see if it compiles. Otherwise, your best bet is to read any provided documentation (e.g. a README or INSTALL file).

Any questions, feel free to ask.

Linux: find users with full sudo access on many machines

Disclaimer: there are surely many, far better ways to do this — feel free to add them in the comments. This was just a quick and dirty script I came up with yesterday,¬†after a co-worker wondered if there was an easy way to do this on all the servers we administer.

The situation: you administer 1000 or more servers, you and your team are the only users who are supposed to be able to sudo to root (unlike simply running certain specific commands, which is typically OK), but sometimes you have to grant temporary full sudo access to a particular user or group of users who, for instance, are doing the initial application installations, but who are supposed to lose that access when the server enters production.

The problem: it’s easy to forget about those, and so the temporary access becomes permanent (yes, there are other ways around¬†that, such as using a specific syntax for those accesses that includes a comment that you then use a script, called by the “at” daemon, to remove later, but bear with me for now). Wouldn’t it be useful to be able to look at a group of some, or even all, of the servers you administer, and find those unwanted, forgotten sudo accesses?

Continue reading “Linux: find users with full sudo access on many machines”