John Chae
John Chae

Reputation: 25

How to scrape data into an excel file

https://m-selig.ae.illinois.edu/ads/coord/ag25.dat

I'm trying to scrape data from the UIUC airfoil database website but all of the links are formatted differently than the other. I tried using pandas read table and use skiprows to skip the non-data point part of the url but every url have a different number of rows to skip. How can I manage to only read the numbers in the url?

Upvotes: 0

Views: 284

Answers (1)

chitown88
chitown88

Reputation: 28640

Use pd.read_fwf() which will read a table of fixed-width formatted lines into DataFrame:

In terms of how to handle different files with different rows to skip, what we could do is once the file is read, just count the rows until there is a line that contains only numeric values. Then feed that into the skiprows parameter.

In the case of values greater than 1.0, we can simply just filter those out from the dataframe.

import pandas as pd
from io import StringIO
import requests

url = 'https://m-selig.ae.illinois.edu/ads/coord/ag25.dat'
response = requests.get(url).text

for idx, line in enumerate(response.split('\n'), start=1):
    if all([x.replace('.','').isdecimal() for x in line.split()]):
        break
    skip = idx    

df = pd.read_fwf(StringIO(response), skiprows=skip, header=None)
df = df[~(df > 1).any(1)]

Output:

    print(df)
                0         1
    0    1.000000  0.000283
    1    0.994054  0.001020
    2    0.982050  0.002599
    3    0.968503  0.004411
    4    0.954662  0.006281
    ..        ...       ...
    155  0.954562  0.001387
    156  0.968423  0.000836
    157  0.982034  0.000226
    158  0.994050 -0.000374
    159  1.000000 -0.000680
    
    [160 rows x 2 columns]

**Option 2:**

import pandas as pd
import requests

url = 'https://m-selig.ae.illinois.edu/ads/coord/b707b.dat'
response = requests.get(url).text

lines = []
for idx, line in enumerate(response.split('\n'), start=1):
    if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
        lines.append(line)

lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]

Upvotes: 1

Related Questions