-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sum functionality for Numeric columns in transactions #53
Conversation
@kelindar Hello, I'm working on fixing the data race, but would this functionality be something wanted here in general? |
this is a great feature IMO |
I've been looking into the test failure in the most recent commit (https://github.com/kelindar/column/runs/6029689995?check_suite_focus=true) and can not replicate it through using Docker to test with Ubuntu 20.04 (Github Actions Workflow uses Bullseye). Test Dockerfile -
However, I did notice that the Github Actions are currently being run with Go 1.16 - https://github.com/kelindar/column/blob/main/.github/workflows/test.yml#L14. Is this something that should be updated, as go.mod indicates 1.17? I can't say this difference itself made the test non-deterministic, but is something to note. If anyone would like to take a look themselves, I'd greatly appreciate it. |
@Dreeseaw, first of all, thanks for the contribution. However, this one is a bit tricky, I need to better understand the use-case and the intent. The approach proposed seems to make the code more concise but not necessarily faster. Currently we can sum all of the cells of a column the following manner: var sum float64
players.Query(func(txn *Txn) error {
balance := txn.Float64("balance")
txn.Range(func(idx uint32) {
v, _ := balance.Get()
sum += v
})
return nil
}) Now, there's a few of problems with this code:
What this PR is proposing would solve problem (1) but not (2) or (3), which could potentially be added in future releases. players.Query(func(txn *Txn) error {
sum := txn.SumFloat64("balance")
return nil
}) There's a couple problems with this approach as well:
Moreover, once we open this pandora's box, we'll also need to implement other aggregation functions such as Now, I do agree that having aggregation functions is useful in the library, we just need to think a bit more about the right way of adding it. What we could do is instead add these functions onto the column readers themselves such as Ideally, our API should look relatively simple and in line with the current design: players.Query(func(txn *Txn) error {
sum := txn.Float64("balance").Sum()
return nil
}) This would solve for problem (1) by making this code more concise. We also have the ability to solve for (3) given that these readers have access to raw data of the column, hence we can use SIMD vectorisation to optimise certain aggregations, such as Now, for the parallelisation issue, it's quite tricky and I don't know how to properly solve it on the API design at the moment. I do think if we vectorise these there will not be a major need for parallelisation unless you operate on very large arrays. We could consider introducing something like |
@kelindar Thanks for the response! I really appreciate hearing from you. I agree that my original implementation isn't that well thought-out. I've been playing around with an API structure more inline with what's already existing, and hit a little snag trying to implement the functionality as you've defined above -
which would require a Sum from numberReaders (column_generate.go) -
The only similar functionality to aggregations is filtering (accessing entire column wrt txn index), which is done via the To try and provide a similar-feeling API to the one you've shown above, I think adding two new generic functions are the best way to provide the functionality under the current API structure -
Let me know what you think of this approach. Also, I'm having a little trouble using genny to generate new column_number.go files for testing. I understand the purpose, but how exactly do you run the tool? |
I finally got some time to clean up this mess: no more |
Benchmark sum performance clean up some tabs and comments Revert "clean up some tabs and comments" This reverts commit 07f51e0. protect column index access per-row clean up int64 func and add access protection to others add error checking for sums, formatting fix whitespace that magically turns into tabs
@kelindar I love the upgrades to Generics & exposing the whole txn for column readers! Since you got around to adding To combat this, I think holding a read lock on each shard accessed in the Let me know what you have in mind for fixing this. Also, thanks again for the generics work to make this possible! |
@Dreeseaw interesting point. So what you're seeing in this transaction test is actually expected, it's a non-repeatable read anomaly as 2 sums in the same transaction can return a different result. Basically, the current isolation level provided by the library is essentially Full read-lock on the transaction would be inefficient, as we would need to know which shards we'd need to lock. This specific scenario can be avoided by implementing repeatable read or snapshot isolation level, but similarly it will have a negative impact on performance and would also require a rather major re-design as each column would need to do something like a copy-on-write of each shard and we'd need to maintain a linked list of modified shards for each transaction. |
Got it, thanks for the explanation. I've removed the unstable test and left the added documentation and benchmark for you to run locally on your next PR. |
This PR introduces a simple sum aggregation for transactions. It adds 3 functions,
txn.SumInt64()
,txn.SumUint64()
, &txn.SumFloat64()
, that sum the given Numeric type column with respect to the txn's current index, via txn_lock'srangeRead()
.Since there are no prior built-in aggregation functions in the package, I understand if the goal is to let the user handler that functionality via a
txn.Range()
call. However, I believe that more work could be done to further parallelize this operation (read & add all chunks concurrently?). Due to the data access required for this, accounting for simple aggregations incolumn
itself seems like the best place.