Mario
Mario

Reputation: 11

Load firewall logs fields to Hive table

I have log files from firewall, but some of the logs do not have all the columns or they appear in different order.

Example of log:

time=2013-08-07 15:00:38|log_component=Firewall Rule|status=Allow|application=Skype Services|src_ip=172.16.16.79|dst_ip=192.168.2.4
time=2013-08-07 15:00:39|log_component=Firewall Rule|status=Allow|src_ip=172.16.16.80
time=2013-08-07 15:00:40|status=Allow|src_ip=172.16.16.81|dst_ip=192.168.2.6
time=2013-08-07 15:00:41|log_component=Firewall Rule|status=Allow|application=Gmail Services|src_ip=172.16.16.82|dst_ip=192.168.2.7

I have used this script to load log fields to Hive table:

DROP TABLE IF EXISTS firewall_logs;
CREATE TABLE firewall_logs(
time STRING,
log_component STRING,
status STRING,
application STRING,
src_ip STRING,
dst_ip STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "time=(.*?)\\|log_component=(.*?)\\|status=(.*?)\\|application=(.*?)\\|src_ip=(.*?)\\|dst_ip=(.*?)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s"
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "/home/hadoop/firewall.log" INTO TABLE firewall_logs;
SELECT * FROM firewall_logs;

This is the result of query:

+---------------------+---------------+-------+----------------+--------------+-------------+
| 2013-08-07 15:00:38 | Firewall Rule | Allow | Skype Services | 172.16.16.79 | 192.168.2.4 |
| NULL                | NULL          | NULL  | NULL           | NULL         | NULL        |
| NULL                | NULL          | NULL  | NULL           | NULL         | NULL        |
| 2013-08-07 15:00:41 | Firewall Rule | Allow | Gmail Services | 172.16.16.82 | 192.168.2.7 |
+---------------------+---------------+-------+----------------+--------------+-------------+

However, I need the log fields to be loaded to Hive table with this format:

+---------------------+---------------+-------+----------------+--------------+-------------+
| 2013-08-07 15:00:38 | Firewall Rule | Allow | Skype Services | 172.16.16.79 | 192.168.2.4 |
| 2013-08-07 15:00:39 | Firewall Rule | Allow | *NULL*         | 172.16.16.80 | *NULL*      |
| 2013-08-07 15:00:40 | *NULL*        | Allow | *NULL*         | 172.16.16.81 | 192.168.2.6 |
| 2013-08-07 15:00:41 | Firewall Rule | Allow | Gmail Services | 172.16.16.82 | 192.168.2.7 |
+---------------------+---------------+-------+----------------+--------------+-------------+

However, Hive SerDe (ReGex) has this limitation (according to the documentation in GitHub):

"RegexSerDe uses regular expression (regex) to deserialize data. It does not support data serialization. It can deserialize the data using regex and extracts groups as columns. In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored."

How can I make fields that are not available in the log to be loaded into the table with a value equal to NULL.

Is there any other Hadoop ecosystem tool that will allow me to format (or map logs fields according to the table fields) the logs before load them to Hive table?

Upvotes: 1

Views: 172

Answers (2)

Ankita
Ankita

Reputation: 28

  1. create table temp (line string);
  2. load data into table temp
  3. insert into main_table as select spilt(regexp_extract(line, "((time=?)[0-9](-)[0-9](-)[0-9](\s)[0-9](:)[0-9](:)[0-9](-)[0-9](-)|$)"),'\=')[1] as page from temp

I have wriiten regex for one column ..you have write regex for all column

select regex() as col1,regex() as col2,....from temp.

I hope this helps.

Upvotes: 0

OneCricketeer
OneCricketeer

Reputation: 191874

dst_ip=(.*?) is consuming the rest of the text, therefore everything after the first line is NULL

Is there any other Hadoop ecosystem tool that will allow me to format (or map logs fields according to the table fields) the logs before load them to Hive table?

Hive will work, but your SerDe needs to be tested better. Pig and Spark will have the same problem if you insist on using Regex.

You do not need Regex, though. Use the pipes as your delimiter

ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

Prior to using that, though, you need to clean up your data to make it consistently delimited. For example, you need || to delimit an empty field.

You can use regex to capture missing fields, for example,

(?:application=(.*?)\\|)?

but yours always expects the columns to be in a strict order

Personally, Spark would be my choice to at least split each line at the pipe, then parse each column conditionally into a class object, from which you define a Dataset and write it into Hive

Upvotes: 1

Related Questions