Using find -exec or xargs to process pathnames with other commands

Ian! D. Allen – www.idallen.com

Fall 2013 - September to December 2013 - Updated 2019-01-06 04:26 EST

1 Using the pathnames found by findIndexup to index


This is optional material for CST8207


The Problem:

The find command is showing me pathnames. I could use the mouse to copy-and-paste these pathnames into many cp commands, but surely there must be a way to automate this? Can the cp command select file names the same way that find can?

The idea of Unix/Linux is that every command does one thing well, so they don’t put features of find into cp. You use find to generate the names and you use cp to copy the names. The trick is getting the names generated by find to be used by cp.

For an introductory assignment, I don’t expect more knowledge than copy and paste using your mouse, but that’s not how a real sysadmin would do it. Here are some optional hints on how a real sysadmin would get the pathnames copied without using a mouse or copy-and-paste.

1.1 Method One – find -execIndexup to index

The designers of the find command built in a mechanism to run a command using the pathnames that find finds. It’s the -exec option. Go read man find and look at how -exec works. The man page for find has one example in the EXAMPLES section of the man page (along with lots of other uses of find) and you can actually use this example to run file on a whole bunch of files:

 find . -type f -exec file '{}' \;

You can append the above -exec and following arguments to any already-working find command you have, replacing the . starting point and -type f expression in the example with your own starting point and expression to find the pathnames you want. The find command line with the above added -exec expression will then run file on each of the pathnames found by find, one at a time.

The find command will run the -exec command once per pathname. The pathname generated by find is inserted into the -exec command line where that quoted set of braces is. You might be able to see it better if you insert an echo in front of the command line being run by find, to echo on your screen the command that is being built and executed:

 find . -type f -exec echo file '{}' \;

(Make sure you get this simple -exec echo file example working on your own set of pathnames before you try to modify it to do something more complicated such as a file copy.)

But of course you don’t want to simply run file on each pathname; you want to copy each pathname into a single destination directory. I’ll leave most of this as an “exercise for the student”, with the following hint:

The above is just one way to automate the copy by having find do the work for you. It has the disadvantage that it runs a separate cp command for every pathname find finds, which is no problem if there are only three pathnames but is a huge problem if there are a million pathnames because find will have to run cp a million times (and that takes time).

Modern versions of find have a modified -exec statement ending in + instead of ; that can pack multiple file names into the same command execution, reducing the number of times the command has to be executed by increasing the number of pathnames passed to each execution:

 find . -type f -exec file '{}' +

This works similarly to xargs, which is described next:

1.2 Method Two – xargsIndexup to index

If you have a million files to copy, using find with the traditional version of -exec is not the way to do it, since you will have to call and run the cp command program once per pathname, and that means running cp a million times. Even if cp did nothing, it would take a long time to re-execute cp a million times. We can do this more efficiently.

The cp command is designed to allow multiple source pathnames if they are all being copied into the same destination directory. We could reduce the number of cp commands run if we could put multiple source pathnames into each cp command line. If we could fit a million source pathnames on one cp command line, we would only need one single cp command to do the work. This is a huge savings compared to running cp a million times.

Alas, most Unix systems have a limit on the total length of a command line. You can’t fit a million pathnames on one single cp command line. This is why the xargs program was written.

The xargs program reads a (usually large) list of pathnames from standard input. It will read those pathnames and pack a command line with as many of those pathnames as can possibly fit, then call the command, then repeat with another large number of pathnames, and repeat again until all the pathnames are processed. By packing each command line as full of pathnames as it possibly can, it uses the minimum number of commands needed to get the job done.

See the man xargs and look at the EXAMPLES section for examples using find to generate pathnames that get sent into xargs. Sysadmin always use the -print0 option to find and the -0 option to xargs so that blanks in pathnames don’t cause problems. (See the man pages.)

Since xargs can only add lists of pathnames to the end of a command line (where most commands expect them), this poses a problem for a file copy that expects all the source filenames to precede the destination directory name. The maintainers of cp invented the -t option to cp so that you could specify the destination directory first on the command line, allowing all the source pathnames to be stacked at the end just the way xargs generates them:

$ cp -t /tmp file1 file2 file3                    # file4 file5 etc...

You need to use the -t option when you use cp inside xargs so that the list of source pathnames can appear at the end of the command line.

Again, insert echo at the start of your xargs command lines (and start with only a few pathnames on standard input, not hundreds) until you see echoing on your screen the command lines you know will work. Then take out the echo and feed the full list of pathnames.

As described in the previous section, modern versions of find have a modified -exec statement ending in + instead of ; that can pack multiple file names into the same command execution, reducing the number of times the command has to be executed by increasing the number of pathnames passed to each execution.

1.3 Method Three – Shell Command Substitution: $(command)Indexup to index

The shells have a command substitution feature that lets you take the standard output of any command and insert it into a command line. (See the heading Command Substitution in man bash, and also previous class notes such as CST8207 Command Substitution or CST8129 Command Substitution.)

You might think of using this handy feature to take the standard output of find (a list of pathnames) and insert it into a cp command line. This command substitution might work, but it has serious limitations:

  1. None of the pathnames can contain any blanks, asterisks, or other shell meta-characters that the shell will expand. This may be true for the pathnames you substitute today, but it won’t always be true!
  2. The total list of pathnames can’t exceed the system limit on the length of a command line. The list might fit today, but it won’t always fit!

In other words, command substitution only works sometimes, where the other two solutions presented earlier work every time (provided you use -print0 in your find command!).

Since sysadmin want solutions that always work and won’t mysteriously start failing in the future, avoid using command substitution to naïvely generate pathnames needed by other commands if those pathnames might ever contain blanks or other shell meta-characters, or if the list of pathnames might be very large. The embedded blanks and shell meta-characters in the pathnames, or the sheer number of pathnames, will some day cause errors if you rely on command substitution.

(With correct use of shell options to turn off file GLOBbing and suppress the splitting of words on blanks, you can almost safely write a shell script that does use command substitution and pathnames, but it isn’t pretty, doesn’t work for file names with newlines in them, and the options used are unsuitable for interactive shell use. It can still stop working if the list of pathnames is longer than is allowed on a command line. Don’t do it!)

Author: 
| Ian! D. Allen, BA, MMath  -  idallen@idallen.ca  -  Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

Plain Text - plain text version of this page in Pandoc Markdown format

Campaign for non-browser-specific HTML   Valid XHTML 1.0 Transitional   Valid CSS!   Creative Commons by nc sa 3.0   Hacker Ideals Emblem   Author Ian! D. Allen