Reputation: 45
I have a dataset that contains some rows with invalid entries in one column.
I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.
Pseudo code might look something like this:
SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)
I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.
The python code would look like this:
df = df[df.apply(lambda x: x.name in x.url, axis=1)]
Which would result in something like this:
df
a b c
/bobs/burgers/1234 bobs idx
/bobs/burgers/2234 fred idx
/cindys/sandwhiches/3234 cindy idx
df = df[df.apply(lambda x: x.name in x.url, axis=1)]
a b c
/bobs/burgers/1234 bobs idx
/cindys/sandwhiches/3234 cindy idx
Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?
Upvotes: 2
Views: 7218
Reputation: 38290
In addition to like
and rlike
, you can use these functions:
Using instr
:
select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;
Using locate
:
select distinct b,c from some.log where date = current_date and locate(b, a) > 0;
See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
Upvotes: 6
Reputation: 11080
LIKE
or RLIKE
select distinct b,c from some.log where date = 'today' and a like concat('%',b,'%');
Or
select distinct b,c from some.log where date = 'today' and a rlike concat('*',b,'*');
Upvotes: 2