Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-19348. Integrate analytics accelerator into S3A. #7334

Draft
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

ahmarsuhail
Copy link
Contributor

Description of PR

Initial integration of analytics accelerator.

How was this patch tested?

In progress

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@ahmarsuhail ahmarsuhail marked this pull request as draft January 28, 2025 13:29
@ahmarsuhail ahmarsuhail changed the title /HADOOP-19348. Integrate analytics accelerator into S3A. HADOOP-19348. Integrate analytics accelerator into S3A. Jan 28, 2025
@ahmarsuhail ahmarsuhail force-pushed the feature-HADOOP-19363-analytics-accelerator-s3 branch 2 times, most recently from e18d0a4 to d45beae Compare January 31, 2025 14:57
@ahmarsuhail
Copy link
Contributor Author

Few things to discuss here:

  • Now that we're using S3A's async client, which already has the execution interceptors attached, a lot of tests fail as out of span operations get rejected. Since we're not support auditing right now, can we recommend that if you're running with AAL turned on, turn off fs.s3a.audit.reject.out.of.span.operations?

  • The async client from the current SDK version doesn't do ranged GETs if multipartEnabled is enabled on it. For ranged GETs, either upgrade SDK or disable multipartEnabled temporary when AAL is enabled, similar to

private static final String LOGICAL_IO_PREFIX = "logicalio";

@Test
public void testConnectorFrameWorkIntegration() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small parquet file, src/test/parquet

can we read the file ~10sKB
does it just complete and not complete

malformed footer

Copy link
Contributor

@mukund-thakur mukund-thakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some old comments about javadoc

public void testOverwriteExistingFile() throws Throwable {
// Will remove this when Analytics Accelerator supports overwrites
skipIfAnalyticsAcceleratorEnabled(this.createConfiguration(),
"Analytics Accelerator does not support overwrites yet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analytics Accelerator is about read optimizations right? How does this relate to overwrite?
Is it because the file will be changed? You mean it doesn't support the RemoteFileChangedException?

@@ -65,6 +66,8 @@ protected Configuration createConfiguration() {
*/
@Test
public void testNotFoundFirstRead() throws Exception {
skipIfAnalyticsAcceleratorEnabled(getConfiguration(),
"Temporarily disabling to fix Exception handling on Analytics Accelerator");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be enabled.

@ahmarsuhail ahmarsuhail force-pushed the feature-HADOOP-19363-analytics-accelerator-s3 branch 2 times, most recently from e18d0a4 to 0d1f291 Compare February 7, 2025 14:57
@apache apache deleted a comment from hadoop-yetus Feb 7, 2025
@apache apache deleted a comment from hadoop-yetus Feb 7, 2025
@apache apache deleted a comment from hadoop-yetus Feb 7, 2025
@apache apache deleted a comment from hadoop-yetus Feb 7, 2025
@apache apache deleted a comment from hadoop-yetus Feb 7, 2025
@apache apache deleted a comment from hadoop-yetus Feb 7, 2025
@apache apache deleted a comment from hadoop-yetus Feb 10, 2025
@apache apache deleted a comment from hadoop-yetus Feb 10, 2025
@apache apache deleted a comment from hadoop-yetus Feb 10, 2025
@apache apache deleted a comment from hadoop-yetus Feb 10, 2025
@ahmarsuhail ahmarsuhail force-pushed the feature-HADOOP-19363-analytics-accelerator-s3 branch from a3c7498 to f408ec5 Compare February 11, 2025 16:23
@apache apache deleted a comment from hadoop-yetus Feb 11, 2025
@apache apache deleted a comment from hadoop-yetus Feb 11, 2025
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 20s #7334 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
GITHUB PR #7334
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7334/13/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

import org.apache.hadoop.fs.s3a.VectoredIOContext;

/**
* Requirements for requirements for streams from this factory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java doc correction.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 20s #7334 does not apply to trunk. Rebase required? Wrong Branch? See https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute for help.
Subsystem Report/Notes
GITHUB PR #7334
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7334/14/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@ahmarsuhail
Copy link
Contributor Author

Commit 99fbdeb means this will no longer build as is, as AAL with the new constructor that lets you pass in file information awslabs/analytics-accelerator-s3#223 must be merged in and released first (WIP!)

To test this currently, set the branch to commit: 038a692

import java.io.EOFException;
import java.io.IOException;

import org.apache.hadoop.fs.FSExceptionMessages;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, imports are out of order


package org.apache.hadoop.fs.s3a.impl.streams;

import org.apache.hadoop.conf.Configuration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usual nit: import ordering, and I'd prefer an explicit import of those Constants which are being used

@Override
public void bind(final FactoryBindingParameters factoryBindingParameters) throws IOException {
super.bind(factoryBindingParameters);
this.s3SeekableInputStreamFactory = new LazyAutoCloseableReference<>(createS3SeekableInputStreamFactory());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you chop this line down..it's too wide fo side-by-side reviews

@@ -115,7 +115,7 @@ public class RequestFactoryImpl implements RequestFactory {
/**
* Callback to prepare requests.
*/
private final PrepareRequest requestPreparer;
private PrepareRequest requestPreparer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't need to be non-final any more, I shall fix in my PR

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending:

  • those little nits
  • my PR in
  • new release of your library (which I've just been looking at...may need a bit of resilience there, especially to premature -1 calls.

@ahmarsuhail
Copy link
Contributor Author

nice, thanks for the review!

What do you mean by premature -1 calls?

@steveloughran
Copy link
Contributor

sometimes a read can return -1 due to network errors, not EOF. in that situation (look at read()) we abort the stream so it doesn't go back into the pool, then ask for a new one. Apparently before the abort() you could get back the same stream again, even through it was now failing. Inevitably, this is a consequence of the stream's long retention of the same connection; if it returned them after 60s this'd be less likely

@ahmarsuhail ahmarsuhail force-pushed the feature-HADOOP-19363-analytics-accelerator-s3 branch from 99fbdeb to b92a661 Compare February 24, 2025 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants