/* RIV stands for Random Index Vector, referring to the method of generating * the basic vectors that correspond to each word. each word has an algorithmically * generated vector which represents it in this mathematical model, such that a word * will produce the same vector each time it is encountered*[1]. this base * vector will be referred to as a L1 vector or a barcode vector * * by summing these vectors, we can get a mathematical representation of * a set of text. this summed vector will be referred to as an L2 vector * or aggregate vector. in its simplest implimentation, an L2 vector * representation of a document contains a model of the contents of the * document, enabling us to compare direction and magnitude of document * vectors to understand their relationships to each other. * * but the system we are really interested in is the ability to form * context vectors * a context vector is the sum of all (L1?) vectors that the word * has been encountered in context with. from these context vectors * certain patterns and relationships between words should emerge. * what patterns? that is the key question we will try to answer * * [1] a word produces the same vector each time it is encountered only * if the environment is the same, ie. RIVs are the same dimensionality * nonzero count is the same. comparing vectors produced in different * environments yields meaningless drivel and should be avoided * * [2] what exactly "context" means remains a major stumbling point. * paragraphs? sentences? some potential analyses would expect a static * sized context (the nearest 10 words?) in order to be sensible, but * it may be that some other definition of context is the most valid for * this model. we will have to find out. * * some notes: * * -sparseRIV vs. denseRIV (sparse vector vs. dense vector) * the two primary data structures we will use to analyze these vectors * each vector type is packed with some metadata * (name, magnitude, frequency, flags) * * -denseRIV is a standard vector representation. * each array index corresponds to a dimension * each value corresponds to a measurement in that dimension * * -sparseRIV is vector representation optimized for largely empty vectors * each data point is a location/value pair where the * location represents array index * value represents value in that array index * * if we have a sparsely populated dense vector (mostly 0s) such as: * * |0|0|5|0|0|0|0|0|4|0| * * there are only 2 values in a ten element array. this could, instead * be represented as * * |2|8| array indexes * |5|4| array values * |2| record of size * * and so, a 10 element vector has been represented in only 5 integers * * this is important for memory use, of course, but also for rapid calculations * if we have two vectors * * |0|0|5|0|0|0|0|0|4|0| * |0|0|0|0|0|0|7|0|3|-2| * and we wish to perform the dot product this will take 10 steps, * 9 of which are either 0*0 = 0, or 0*x = 0 * if we instead have these represented as sparse vectors * |2|8| * |5|4| * |2| * * |6|8|9| * |7|3|-2| * |3| * * we only need to search for matching location values * or, better yet, if we use a hybrid analysis: * |0|0|5|0|0|0|0|0|4|0| * ___________/__/_/ * / / / * |6|8|9| * |7|3|-2| * |3| * we can simply access the dense vector by indexes held in the sparse vector * reducing this operation to only 3 steps */
R