Conor
Conor

Reputation: 43

Remove text up to Nth instance of pattern match in csv files

I'm looking for a way to remove the first n lines from csv files.

Basically I've been given a dump of several hundred csv files with the task of creating a queryable MySQL database. The files have a legend in non-csv format taking up the first ~10 lines and throw an error when attempting to import to MySQL. The legend is variable in length as not all files have the same number of parameters.

I'm looking for a way to remove the legend and the only pattern I can find is that the first csv element is always the second instance of the word year.

The files basically look something like this, I want the start of each file to be the second instance of lower-case year.

Legend:
non-csv text...
year: Year
... etc

(csv format) year, month, day, etc...

I've looked at sed commands to loop through each file but can't find one that achieves exactly what I want. i.e:

find . -name "*.csv" | 
while read filename; 
do 
  sed -n '/year/,$p' $filename > newFile.csv;
done;

This removes all text before the first instance of year but I'm unfamiliar with sed and can't figure out how to make it skip to the second instance. I tried the above in a recursive function but it didn't work.

Any suggestions?

Upvotes: 2

Views: 67

Answers (2)

potong
potong

Reputation: 58558

This might work for you(GNU sed):

sed ':a;N;s/year/&/2;Ta;s/.*\n//' file

This gathers up lines until the second appearance of year and then deletes all lines up to but not including the current line.

Upvotes: 1

karakfa
karakfa

Reputation: 67557

awk to the rescue!

$ awk '/year/{c++} c>1' file

(csv format) year, month, day, etc...

Upvotes: 3

Related Questions