README.txt



/* RIV stands for Random Index Vector, referring to the method of generating
 * the basic vectors that correspond to each word.  each word has an algorithmically
 * generated vector which represents it in this mathematical model, such that a word
 * will produce the same vector each time it is encountered*[1]. this base
 * vector will be referred to as a L1 vector or a barcode vector
 * 
 * by summing these vectors, we can get a mathematical representation of
 * a set of text.  this summed vector will be referred to as an L2 vector
 * or aggregate vector.  in its simplest implimentation, an L2 vector
 * representation of a document contains a model of the contents of the 
 * document, enabling us to compare direction and magnitude of document 
 * vectors to understand their relationships to each other.
 * 
 * but the system we are really interested in is the ability to form 
 * context vectors
 * a context vector is the sum of all (L1?) vectors that the word
 * has been encountered in context with. from these context vectors
 * certain patterns and relationships between words should emerge. 
 * what patterns? that is the key question we will try to answer
 * 
 * [1] a word produces the same vector each time it is encountered only 
 * if the environment is the same, ie. RIVs are the same dimensionality
 * nonzero count is the same.  comparing vectors produced in different 
 * environments yields meaningless drivel and should be avoided
 * 
 * [2] what exactly "context" means remains a major stumbling point.
 * paragraphs?  sentences?  some potential analyses would expect a static
 * sized context (the nearest 10 words?) in order to be sensible, but 
 * it may be that some other definition of context is the most valid for
 * this model.  we will have to find out.
 * 
 * some notes:
 * 
 * -sparseRIV vs. denseRIV (sparse vector vs. dense vector)
 * the two primary data structures we will use to analyze these vectors
 * each vector type is packed with some metadata 
 * (name, magnitude, frequency, flags)
 * 
 * 	-denseRIV is a standard vector representation.  
 * each array index corresponds to a dimension
 * each value corresponds to a measurement in that dimension
 * 
 * 	-sparseRIV is vector representation optimized for largely empty vectors
 * each data point is a location/value pair where the
 * location represents array index 
 * value represents value in that array index
 * 
 * if we have a sparsely populated dense vector (mostly 0s) such as:
 * 
 * |0|0|5|0|0|0|0|0|4|0|
 * 
 * there are only 2 values in a ten element array. this could, instead
 * be represented as
 * 
 * |2|8| array indexes
 * |5|4| array values
 * |2|   record of size
 * 
 * and so, a 10 element vector has been represented in only 5 integers
 * 
 * this is important for memory use, of course, but also for rapid calculations
 * if we have two vectors
 * 
 * |0|0|5|0|0|0|0|0|4|0|
 * |0|0|0|0|0|0|7|0|3|-2|
 * and we wish to perform the dot product this will take 10 steps,
 * 9 of which are either 0*0 = 0, or 0*x = 0
 * if we instead have these represented as sparse vectors
 * |2|8| 
 * |5|4| 
 * |2|  
 * 
 * |6|8|9|
 * |7|3|-2|
 * |3|
 * 
 * we only need to search for matching location values 
 * or, better yet, if we use a hybrid analysis:
 * |0|0|5|0|0|0|0|0|4|0|
 *   ___________/__/_/ 
 *  / / /
 * |6|8|9|
 * |7|3|-2|
 * |3|
 * we can simply access the dense vector by indexes held in the sparse vector
 * reducing this operation to only 3 steps
 */