Digital Farmer
Digital Farmer

Reputation: 2137

Filter IMPORTHTML data

When I import data, it comes in this format (image 1), with blank spaces. I would like to know if there is any way to adjust so that these blanks disappear, the two models expected (image 2 and 3) if there was any way to reach them would be important to me.

Remembering that all dates have / and all times have :

I tried to filter from QUERY, but when trying to "Select Col1, Col2, Col4 Where Col2 is not null" the dates disappear and only the times remain, I tried via REGEXMATCH to separate the dates from the times using / and : but also I was not successful.

I also tried it via IMPORTXML, but some data ends up not being imported correctly on some pages of the site, for IMPORTHTML these errors do not happen. The XML's I used were:

"//tr[@class='no-date-repetition-new' and ..//td[@class='team team-a']] | //tr[@class='no-date-repetition-new live-now' and ..//td[@class='team team-a']]"
"//td[@class='team team-a']/a | //td[@class='team team-a strong']/a"

The current formula is as follows:
=IMPORTHTML("https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/","table",1)

IMPORTHTML Original: enter image description here

Expected formats:
enter image description here --- enter image description here

Upvotes: 2

Views: 1722

Answers (2)

marikamitsos
marikamitsos

Reputation: 10573

You can join 2 queries together (one next to the other) in a single formula, to get your results

={QUERY(IMPORTHTML("https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/","table",1), 
    "select Col1 where Col2 is null and not Col1 contains '*'",1), 
QUERY(IMPORTHTML("https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/","table",1), 
    "select Col1, Col2, Col3, Col4 where Col2 is not null label Col1 'Time'",1)}

How the formula works:

  1. As you notice the data part of both queries is the same in both of them. What is actually different is "what we ask for from the query"
  2. In the first one we use "select Col1 where Col2 is null and not Col1 contains '*'"
  3. In the second one "select Col1, Col2, Col3, Col4 where Col2 is not null label Col1 'Time'"
  4. We create an array by joining them together as in ={1stQUERY,2ndQUERY}

enter image description here

Upvotes: 2

Wicket
Wicket

Reputation: 38425

Rather than filtering what you need is to restructure the imported data.

Anyway, I think that the easier solution to get the final result is to use multiple IMPORTXML formulas.

URL
A1: https://int.soccerway.com/national/austria/1-liga/20192020/regular-season/r54328/

Headers
A2: //table[contains(@class,'matches')]/thead/tr/th

Day
A3: //td[contains(@class,'date')]/parent::tr

Teams and Score
A4: //td[contains(@class,'team-a')]/parent::tr

A6: =transpose(IMPORTXML($A$1,A2))
A7: =IMPORTXML($A$1,A3)
B7: =IMPORTXML(A1,A4)

You might want to replace the formula on A6 by static values in order to place them properly.

Upvotes: 2

Related Questions