Reputation: 6555

Extracting data from dataset using regex

I have this dataset:

LP3I22- M5
01174c-qbFD.raw
L2P2 + p LPI Full ms [150.00-1500.00]
Scan #: 1
RT: 6.11
m/z Intensity   Relative    Resolution  Charge  Baseline

  150.0119         67.3     0.00    152545.44       0.00       26.27
  150.0153         59.3     0.00    269991.72       0.00       26.28
  150.0156         66.1     0.00    288504.16       0.00       26.28
  150.0161         67.2     0.00    172425.14       0.00       26.28
  150.0330         78.9     0.00    167957.34       0.00       26.32
  150.0485         75.0     0.00    208783.14       0.00       26.35
  150.0603        166.2     0.00    220081.53       0.00       26.37
  150.0624         75.8     0.00    189976.39       0.00       26.38
  150.0866         70.1     0.00    233127.77       0.00       26.42
  150.0991         54.8     0.00    193755.25       0.00       26.45
  150.1136         62.9     0.00    184047.91       0.00       26.48
  150.1348         85.4     0.00    206299.06       0.00       26.52
  150.1410         68.7     0.00    225439.47       0.00       26.53
  150.1428         73.1     0.00    205324.42       0.00       26.54
  150.1498         61.2     0.00    199792.59       0.00       26.55
  150.1572         56.8     0.00    160342.95       0.00       26.57
  150.1583         71.4     0.00    187849.53       0.00       26.57
  150.1746         84.7     0.00    211934.81       0.00       26.60
  150.1777         81.2     0.00    251123.45       0.00       26.61
  150.2106         65.7     0.00    198830.13       0.00       26.67
  150.2144         53.7     0.00    190111.53       0.00       26.68
  150.2781         74.0     0.00    187803.52       0.00       26.81
  150.2807         90.7     0.00    174743.38       0.00       26.82

How can I extract the data results using regex? I'm not interested in the the first 7 lines.

Upvotes: 1

Answers (4)

bta

Reputation: 45057

lines = IO.readlines('inputfile.txt')
data = lines[7..-1].collect{|x| x.scan(/([^\d]+[\d.]+)/).flatten.map{|y| y.strip}}

For a simpler solution that doesn't involve a regex, replace the last line with:

data = lines[7..-1].collect{|x| x.split}

This all assumes that the data set matches the one you listed and does not contain any unexpected or improperly-formatted values.

Upvotes: 3

steenslag

Reputation: 80075

7.times{DATA.readline}  # discard first 7 lines
res = DATA.map{ |line| line.lstrip.squeeze.split(' ').map{|el| el.to_f } }

__END__
LP3I22- M5
01174c-qbFD.raw
L2P2 + p LPI Full ms [150.00-1500.00]
Scan #: 1
RT: 6.11
m/z Intensity   Relative    Resolution  Charge  Baseline

  150.0119         67.3     0.00    152545.44       0.00       26.27
  150.0153         59.3     0.00    269991.72       0.00       26.28
  150.0156         66.1     0.00    288504.16       0.00       26.28
  150.0161         67.2     0.00    172425.14       0.00       26.28
  150.0330         78.9     0.00    167957.34       0.00       26.32
  150.0485         75.0     0.00    208783.14       0.00       26.35
  150.0603        166.2     0.00    220081.53       0.00       26.37

The values in res are now floats:

 [[150.019, 67.3, 0.0, 152545.4, 0.0, 26.27], [150.0153, 59.3, 0.0, 2691.72, 0.0, 26.28],
 [150.0156, 6.1, 0.0, 28504.16, 0.0, 26.28], [150.0161, 67.2, 0.0, 172425.14, 0.0, 26.28],
 [150.03, 78.9, 0.0, 167957.34, 0.0, 26.32], [150.0485, 75.0, 0.0, 208783.14, 0.0, 26.35],
 [150.0603, 16.2, 0.0, 2081.53, 0.0, 26.37]

Upvotes: 1

John Douthat

Reputation: 41189

Assuming it's in a String called data

number_re = /\s*(\d+\.\d+)\s*/
data.scan(/^#{number_re.source * 6}$/)

That will result in the following array

[["150.0119", "67.3", "0.00", "152545.44", "0.00", "26.27"],
 ["150.0153", "59.3", "0.00", "269991.72", "0.00", "26.28"],
 ["150.0156", "66.1", "0.00", "288504.16", "0.00", "26.28"],
 ["150.0161", "67.2", "0.00", "172425.14", "0.00", "26.28"],
 ["150.0330", "78.9", "0.00", "167957.34", "0.00", "26.32"],
 ["150.0485", "75.0", "0.00", "208783.14", "0.00", "26.35"],
 ["150.0603", "166.2", "0.00", "220081.53", "0.00", "26.37"],
 ["150.0624", "75.8", "0.00", "189976.39", "0.00", "26.38"],
 ["150.0866", "70.1", "0.00", "233127.77", "0.00", "26.42"],
 ["150.0991", "54.8", "0.00", "193755.25", "0.00", "26.45"],
 ["150.1136", "62.9", "0.00", "184047.91", "0.00", "26.48"],
 ["150.1348", "85.4", "0.00", "206299.06", "0.00", "26.52"],
 ["150.1410", "68.7", "0.00", "225439.47", "0.00", "26.53"],
 ["150.1428", "73.1", "0.00", "205324.42", "0.00", "26.54"],
 ["150.1498", "61.2", "0.00", "199792.59", "0.00", "26.55"],
 ["150.1572", "56.8", "0.00", "160342.95", "0.00", "26.57"],
 ["150.1583", "71.4", "0.00", "187849.53", "0.00", "26.57"],
 ["150.1746", "84.7", "0.00", "211934.81", "0.00", "26.60"],
 ["150.1777", "81.2", "0.00", "251123.45", "0.00", "26.61"],
 ["150.2106", "65.7", "0.00", "198830.13", "0.00", "26.67"],
 ["150.2144", "53.7", "0.00", "190111.53", "0.00", "26.68"],
 ["150.2781", "74.0", "0.00", "187803.52", "0.00", "26.81"],
 ["150.2807", "90.7", "0.00", "174743.38", "0.00", "26.82"]]

Upvotes: 6

Highly Irregular

Reputation: 40609

Use pattern:

^\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*(\d+\.\d+)\s*$

in multiline mode

Upvotes: 1

Extracting data from dataset using regex

Answers (4)

Related Questions