Shan
Shan

Reputation: 11

Using grep count how many books have been published by an author in a certain time period

I'm currently learning grep as well as regex / other shell tools. The input txt file is of a library that contains authors, their books, and editions

ISBN Title Edition Author Year
8298 foo 3 Charles 1999

I need to use grep to find the total number of books that were published between a certain time period (example, 1993-2008).

I have tried:

grep -E ', (19(7[5-9]|[89][0-9])|200[0-5])$'

This didn't produce any result.

I also tried a regex that I came up with, which does produce the right number, but the regex isn't correct.

\s(197[5-9]|198[0-9]|199[0-9]|200[0-5])

Sorry for the vagueness in the question. Also, I'm running on WSL if that makes a difference.

Upvotes: -1

Views: 95

Answers (4)

tripleee
tripleee

Reputation: 189789

In the general case, to grep a specific field in comma-separated input, specify how many comma-separated fields to skip before the match.

grep -E '^([^,]*,){4}[[:space:]]*(19(7[5-9]|89][0-9]|200[0-5])$' file.csv

The expression [^,]*, matches one field and the comma after it, i.e. zero or more characters which are not comma followed by one which is. By anchoring to beginning of line ^ and specifying four repetitions of this expression which skips one field, we target the beginning of the fifth.

Some, but not all, grep implementations allow you to generalize the final anchor to (,|$) i.e. to look for either another comma (for lines with more than five fields) or end of line (for lines with exactly five).

In real life, CSV files can contain quoted fields which embed a literal comma, so then you need a more complex regular expression. Real-life CSV files can also contain quoted fields which span multiple lines, so then grep (or nontrivial Awk) alone will not cut it.

(Also, real CSV files don't have spaces after the commas.)

Upvotes: 0

jhnc
jhnc

Reputation: 16819

Since year appears at end of line, and assuming each record is a single line:

grep -E ', (19(7[5-9]|[89][0-9])|200[0-5])$'

Upvotes: 0

phuclv
phuclv

Reputation: 41962

You can use this

awk -F',' '1975 <= $5 && $5 <= 2005' books.txt

-F is used for setting the field separator, and $5 is the 5th field

Upvotes: 1

Shan
Shan

Reputation: 11

I think I've figured it out though.

\s(197[5-9]|198[0-9]|199[0-9]|200[0-5])

If anyone has a better solution, do let me know. Thanks.

Upvotes: -2

Related Questions