RE's and grep

Tue Sep 2 14:11:50 EDT 2003

On Tue, 2 Sep 2003 karina.popkova at verizon.net wrote:

> I am interested in finding words in a file
> like a dictionary (file.dict) that have a
> basic match to a pattern.
> I just read this, so I will try this.
> >>% egrep "^\(fee\|fie\)" junk.txt
>
> Now, what I really want to do
> is to search for general words
> that have a pattern, but also to
> exclude a specific vowel or consonant.

This is what pipes & filters work best at!

The trick for such questions is usually to grep once for the broader
pattern, and then pipe that to another grep to narrow down the results to
just the ones you're looking for.

> I.E., to make as a part of the search pattern
> that you call an RE, to look for given words
> and make sure that the vowel "a" or the
> vowel "e" is not a part of the word or string?

Here's one way to do it:

  egrep \
    '(pattern|another pattern|a third pattern)' \
    /usr/share/dict/words  \
  | grep -vi '[ae]'

This looks for "pattern", "another pattern", or "a third pattern" in the
word list file /usr/share/dict/words, then removes all lines in that list
that have either of the letters 'a' or 'e' (or 'A' or 'E', because the -i
flag I passed to the second grep makes the match case Insensitive).

Note that, because the word 'pattern' has both an 'a' and an 'e', this
particular example will never match anything, but you get the idea :)

> How could you find all words in a file that do not
> have the letter: a, or e or i, and so on (???)

Here's one way to do it:

  grep -v '[aei]' /path/to/file

Note: this matches *lines* that have none of the bracketed characters. If
you actually want to match individual words, and it can't be assumed that
the file has one word per line, then this has to be accounted for.

Here's a way to handle that, using `fmt` to "flatten" the file:

  fmt -1 /path/to/file | grep -v '[aei]'

> Or do not have the letters a, and e, and i in the same word?

Building off the last example, you could do something like this:

  fmt -1 /path/to/file | grep '^[^aei]*$'

Note that in other examples, I set up a regular character class, as --

  [abcd]

where you match any of the characters in brackets. In this last example, I
instead decided to use a negated character class, as --

  [^abcd]

where you match any characters *except* the bracketed ones.

So, if you want to exclude single letters, these are roughly equivalent:

  grep '[^abcd]' /path/to/file
  grep -v '[abcd]' /path/to/file

The former looks for non [abcd] lines, while the latter looks for lines
that are not [abcd] lines.

Subtle difference. You may find that one version is more efficient than
the other, and the two may handle edge cases differently. My hunch is that
the [^abcd] variant will usually be faster & easier, but I haven't
actually tested this idea.

The situation where the -v exclusion match excels is when you want to
exclude not just an individual character, but whole words or phrases:

  egrep -v 'this|that|another thing' /path/to/file

There is no trivial equivalent to this in character classes.

On the other hand, character classes might let you "inline" reverse
matches in some cases:

  % grep '^j[^l]*y$' months
  january

As opposed to something like

  % grep '^j.*y$' months | grep -v 'l'
  january

Make sense?

Take a look at _Mastering Regular Expressions_ for more details. For such
a seemingly dry subject, it's a fascinating read... :)

-- 
Chris Devers cdevers at pobox.com
http://devers.homeip.net:8080/

Malloc, malloc, n & v. trans.
1 n. Canaanite deity controlling memory allocations.
2 v. trans. C/C++ library. To request space on the heap.

    -- from _The Computer Contradictionary_, Stan Kelly-Bootle, 1995