r1ty
r1ty

Reputation: 45

Hive select rows where string in column A contains string from column B

I have a dataset that contains some rows with invalid entries in one column.

I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.

Pseudo code might look something like this:

SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)

I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.

The python code would look like this:

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

Which would result in something like this:

df

a                         b        c
/bobs/burgers/1234        bobs     idx 
/bobs/burgers/2234        fred     idx
/cindys/sandwhiches/3234  cindy    idx

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

a                         b        c
/bobs/burgers/1234        bobs     idx
/cindys/sandwhiches/3234  cindy    idx

Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?

Upvotes: 2

Views: 7218

Answers (2)

leftjoin
leftjoin

Reputation: 38290

In addition to like and rlike, you can use these functions:

Using instr:

select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;

Using locate:

select distinct b,c from some.log where date = current_date and locate(b, a) > 0;

See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Upvotes: 6

nobody
nobody

Reputation: 11080

LIKE or RLIKE

select distinct b,c from some.log where date = 'today' and a like concat('%',b,'%');

Or

select distinct b,c from some.log where date = 'today' and a rlike concat('*',b,'*');

Upvotes: 2

Related Questions