Reputation: 25
https://m-selig.ae.illinois.edu/ads/coord/ag25.dat
I'm trying to scrape data from the UIUC airfoil database website but all of the links are formatted differently than the other. I tried using pandas read table and use skiprows
to skip the non-data point part of the url but every url have a different number of rows to skip.
How can I manage to only read the numbers in the url?
Upvotes: 0
Views: 284
Reputation: 28640
Use pd.read_fwf()
which will read a table of fixed-width formatted lines into DataFrame:
In terms of how to handle different files with different rows to skip, what we could do is once the file is read, just count the rows until there is a line that contains only numeric values. Then feed that into the skiprows
parameter.
In the case of values greater than 1.0, we can simply just filter those out from the dataframe.
import pandas as pd
from io import StringIO
import requests
url = 'https://m-selig.ae.illinois.edu/ads/coord/ag25.dat'
response = requests.get(url).text
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').isdecimal() for x in line.split()]):
break
skip = idx
df = pd.read_fwf(StringIO(response), skiprows=skip, header=None)
df = df[~(df > 1).any(1)]
Output:
print(df)
0 1
0 1.000000 0.000283
1 0.994054 0.001020
2 0.982050 0.002599
3 0.968503 0.004411
4 0.954662 0.006281
.. ... ...
155 0.954562 0.001387
156 0.968423 0.000836
157 0.982034 0.000226
158 0.994050 -0.000374
159 1.000000 -0.000680
[160 rows x 2 columns]
**Option 2:**
import pandas as pd
import requests
url = 'https://m-selig.ae.illinois.edu/ads/coord/b707b.dat'
response = requests.get(url).text
lines = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
lines.append(line)
lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]
Upvotes: 1