Reputation: 4691
We are running our spark ingestion jobs which process multiple files in batches.
We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query table.
Jobs are getting completed successfully with no issues but still I see some Info messages like:
25/02/28 03:13:53 INFO Configuration: resource-types.xml not found
25/02/28 03:13:53 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/02/28 03:13:54 INFO YarnClientImpl: Submitted application application_1740026198297_0223
25/02/28 03:13:55 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at dnb-uat-pipe-pm-grp-a-transform-v1-m.c.bmas-eu-pm-dnb-uat-pipe.internal./10.31.167.88:8030
25/02/28 03:13:56 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=97, gcs_api_total_request_count=2, gcs_connector_time=265, gcs_list_file_request=1, gcs_list_file_request_max=49, gcs_list_file_request_mean=49, gcs_list_file_request_min=49, gcs_metadata_request=1, gcs_metadata_request_max=48, gcs_metadata_request_mean=48, gcs_metadata_request_min=48, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_max=265, op_get_file_status_mean=265, op_get_file_status_min=265, uptimeSeconds=8} [CONTEXT ratelimit_period="5 MINUTES" ]
25/02/28 03:13:56 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/02/28 03:13:57 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-europe-north1-128913892554-flfmc7ba/ed61bdc8-e009-4f91-8be7-1abc665d500c/spark-job-history/application_1740026198297_0223.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
25/02/28 03:14:30 INFO FilesystemCsvReader: Total header groups: 1, non-empty header groups sizes [390]
25/02/28 03:14:56 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/02/28 03:14:57 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://dataproc-temp-europe-north1-128913892554-flfmc7ba/ed61bdc8-e009-4f91-8be7-1abc665d500c/spark-job-history/application_1740026198297_0223.inprogress [CONTEXT ratelimit_period="1 MINUTES [skipped: 33]" ]
Do we really need to fix these information messages
GoogleHadoopOutputStream: hflush(): No-op due to rate limit
resource-types.xml not found
Are they causing any performance issues or we can just leave it?
Upvotes: 1
Views: 20
Reputation: 2768
The method hflush
is needed to have a guarantee that the written data is visible for following readers of the object after this this method call:
Syncable.hflush()
Flush out the data in client’s user buffer. After the return of this call, new readers will see the data. The hflush() operation does not contain any guarantees as to the durability of the data. only its visibility.
So, this log message just means that there is no such guarantee, because the object in GCS is updated too often, exceeding the limit. Although the writing operation is successful, but as it said in the following:
...readers will not yet see flushed data
This means that readers may not see these changes right away, though they will see them eventually.
Are they causing any performance issues or we can just leave it?
It is not about performance, but about consistency, so it depends on how the written data is used further. For example, if there is another automatic processing of this data, which starts instantly after the writing, some data may be missed.
Update: In your particular case the log message is about spark-job-history/application_1740026198297_0223.inprogress
. It is not data files, but event logs:
When running long running streaming applications, the HDFS storage gets filled up with large
*.inprogress
files inhdfs://spark-history/
directory
Relevant Spark configuration options are with prefix spark.eventLog.*
Upvotes: 1