How to gain access to private byte[] _sequenceData

Jan 21, 2014 at 1:58 AM
The documentation for the Sequence class says:
/// For users who wish to get at the underlying data directly, Sequence provides
/// a means to do this as well. This may be useful for those writing algorithms
/// against the sequence where performance is especially important. For these
/// advanced users access is provided to the encoding classes associated with the
/// sequence.
How does this work?

I see internal byte[] GetInternalArray(), but it seems to be of no use outside Bio.dll.

Thanks,

Mark
Developer
Jan 21, 2014 at 2:11 AM
Hi Mark,

Yes, the internal keyword means that it only is accessible to things compiled in the same DLL, to access the byte array you have a couple options, shown below.
Sequence s = new Sequence (DnaAlphabet.Instance, "ACGTCTG");
byte v = s [1];//get second element
byte[] arr = s.ToArray ();//create a copy to work with
The first option is just to treat the sequence directly as an array, as it exposes the [] methods (albeit read only).

In the second method, you can also just create a copy using the extension method ToArray, which I often do if I want to manipulate a copy of the data.

-N
Jan 21, 2014 at 3:59 AM
Thanks for the response.

I should have been more explicit. What I need is a way to write directly to byte[] underlying the Sequence object.

I have been copying the data from Sequence objects into byte [], modifying it, and then converting it back to a new Sequence object again. But, this is computationally inefficient and makes the code unnecessarily complicated.

I understand there are reasons for keeping Sequence read-only. But, it would be very helpful to be able to derive a class that is not read-only from Sequence.

What I am considering is:

1) Adding Sequence.cs to my project and changing the namespace. Remove "sealed". Change GetInternalArray() from internal to protected and _sequenceData from private to protected.
2) Derive class EditableSequence from Sequence. Add "set" to "new public byte this[long index]" to provide write functionality.
3) Replace Sequence with EditableSequence where necessary.
4) Eventually, I would like to override GetSubSequence() with a method that constructs a new EditableSequence object in which _sequenceData just points to the appropriate range of the array in the original object. That would obviously require some extra code to keep track of dependencies for disposal, but seems to be an opportunity to increase execution efficiency considerably.

Is there a better way to accomplish what I am trying to do? Is there a good reason to endure the headaches of copying back and forth to byte []?

Thanks,

Mark
Developer
Jan 21, 2014 at 4:53 AM
Hi Mark,

Yes, the need for an editable sequence class has come up a few times before, and perhaps it is time to implement that change. I vaguely remember talking with someone awhile ago about this (maybe Mark or Jim?) and think we opted against it, though can't really remember the reasons (maybe just had other things to do). Although I can't really remember what the downside was, but perhaps there was someplace where ISequence was built to assume read only. Does anyone remember this?

Barring any conflicts, what you are proposing does sound reasonable. I am not quite sure I understand what you are trying to do though. Having a set of sequence data (with different ranges) be backed up by the same array would be one thing that could be useful (note this is how the DerivedSequence class works), but perhaps you are thinking of a fancier copy-on-write system for having multiple sequences share the same data? (the disposal is taken care of automatically by the GC so won't be an issue).

One thing I do remember was finding that the array copying back and forth actually had a pretty negligable effect on performance (At least for what I did, which was mostly small sequences). Definitely be sure to pass in an alphabet and a flag to avoid validation when making sequences though, as the validation does have a reasonably significant overhead.

-N
Coordinator
Jan 21, 2014 at 10:31 AM
Hey Mark & Nigel,

Yes, it has come up before. I think the main reason it's not allowed is integrity. Sequence is intended to be read-only - there are no operations on it which do any edits directly to the sequence itself. In v1, it had operators to allow edits and we found that performance suffered. Hence the sealed keyword (improve performance significantly for virtual methods in certain cases) and the removal of read/write methods. I'm not necessarily opposed to adding in a method to get access to the internal array, it just causes issues down the road - such as a loss of validation, i.e. the sequence cannot guarantee that the sequence is valid according to the alphabet anymore once you provide access to the bytes directly. Today, validation is generally guaranteed.

My opinion today would be If you need an editable sequence, either roll your own ISequence to provide for that (super easy - and that's what most do), or use another implementation such as SparseSequence (which is slower due to the Dictionary approach for storage). Almost everything in the framework relies on ISequence vs. the actual Sequence implementation so you would still have access to everything.

By the way, the reason I added the internal accessor was for the internal global alignment algorithms. Since they are accessing the data overall but not editing it, it made sense for them to have direct access to the data to avoid the copy and memory pressure that produced.

mark
Marked as answer by RMarkT on 1/23/2014 at 9:23 AM
Jan 21, 2014 at 12:36 PM
Nigel and Mark:

Thanks for the prompt and useful responses.

The strategy I outlined above did not work as it left me with two nearly identical classes, Bio.Sequence and MyProject.Sequence, and I could not cast from one to the other or from EditableSequence to Bio.Sequence, which was what I had intended.

What I understand you to have said is that I could copy Sequence.cs into my project, rename Bio.Sequence to MyProject.EditableSequence, leave it as an implementation of ISequence instead of trying to inherit it from another class, and make a few minor changes to make the sequence editable and expose the underlying byte[].

I am still trying to wrap my head around the concept of interfaces (and many other aspects of C#), but what I think you are saying is that I should be able to leave most of my current objects as Sequence, make the ones I need to modify EditableSequence, and I should still be able to use EditableSequence, SparseSequence, QualitativeSequence, or DerivedSequence in most places where I have been using Sequence.

Am I on the right track?

Thanks,

Mark
Coordinator
Jan 22, 2014 at 2:37 PM
Hi Mark,

Yep. Try to pass sequences around as ISequence (i.e. make inputs to methods, etc. all of type ISequence) so you can then use any sequence-type you need internally in your program to generate or edit the sequence.

mark

Jan 23, 2014 at 5:26 PM
Mark:

Thanks for the help.

What I described in my previous post seems to be working well and to have solved the fundamental problem I was having.

Regards,

Mark