20050927, 10:39  #1 
Jun 2003
Russia, Novosibirsk
2×107 Posts 
Best way to store huge data
For example the list of expos with factors, or list of expos with bit levels. What ideas?

20050927, 13:30  #2 
May 2005
Naperville, IL, USA
304_{8} Posts 
HiddenWarrior,
What do you mean by "huge"? Are you talking about lots of individual data items or are you talking about each data item requiring many bits of storage for its representation? What do you mean by "best"? Do you have some particular purpose for the data? Are you archiving it or do you need to have it available for quick access by some process? I don't mean to give you the third degree here, but without some additional information it would be very difficult to provide a meaningful answer to your question. John 
20050927, 17:53  #3 
"Richard B. Woods"
Aug 2002
Wisconsin USA
2^{2}·3·641 Posts 
HiddenWarrior,
... and please give rough numerical estimates of size of what you're looking at. For instance, 10^6 items of 10^3 bytes each, or 10^10 items of 10^6 bytes each, or what? Recommendations may vary accordingly. :) 
20051004, 09:53  #4 
Jun 2003
Russia, Novosibirsk
2·107 Posts 
I mean the best way (minimal in size) to store strings like exponent,factor.
For example, there's one way: To store 29,233 (29=expo, 233=divider) we do the following: 1. Make first byte to show where the symbol ',' is = 3 2. Removing ',' from string and we get 29233 3. Converting it to 256 base (dividing it 256 till the number is <256). Result is: 114, 49. 4. Now we got string: 3,114,49  we convert it to ASCII symbols and get 3 bytesstring. Finally, from original 6 bytes we got only 3 bytes (from which we can restore original)! 50% compression for that expo,factor. Is that method the best? What mechanism is in Human readable format? 
20051004, 10:23  #5 
Jun 2003
Russia, Novosibirsk
2·107 Posts 
Ok, in advance, we can extend above algo:
Let we have string, that hold info about expo (e),status (s),value (v): e...,s,v... e... is maximum 99 bytes length (Billion Digits project works only with 10 bytes expos in length, so we have a huge ubound limit!) s is 1 byte. It's values are: 0 = untested expo (in this case v... holds current factorization level), 1 = mersenne (v... holds the mersenne position in oficial tables 1..42 for now :), 2 = has divider (v... holds the divider). So, we approve the algo: 1: 1 byte is the position of first symbol ',' in the string = len(e...)+1, than we add to that byte s*100 and we get: 1 byte = len(e...)+1+s*(100) 2: We remove ',' from the string and get: e...sv... 3: as the s is already in 1 byte, we remove it: e...v... 4: Converting e...v... to 256base by dividing: e...v...(base 10)>e..v...(base256) 5: Converting all received bytes to ASCII chars  it is the result! Example (first Mersenne + first with divider + first unfactored): 2,1,1 11,2,23 1061,0,60 Counting 1 bytes: 1+1+1*100=102 2+1+2*100=203 4+1+0*100=4 Removing ',': 211 11223 1061060 Removing s: 21 1123 106160 Base256: 21 4,99 1,158,176 Converting to ASCII and splitting it by any symbol: 102,21^203,4,99^4,1,158,176 (sorry, I can't show here the ASCII  symbols :) Finally, we got 11 bytes to store original info (it was 21 bytes without CRLFs)! At last, we can compress that resulting string! 
20051004, 13:37  #6 
May 2005
Naperville, IL, USA
2^{2}×7^{2} Posts 
Necessary to restore?
HiddenWarrior, You have done a great job of compressing the data if that's all you want to do (for whatever reason). However, if you want to reverse the process and get the original data back from your compressed version, I'm afraid you have discarded too much information. In your discussion, you compute a representation for the position of the commas in your original data, but I don't think you store that number with the output. Also, you could look more closely at the definition of s. You don't need an entire byte to hold all of its values; it can be represented in its entirety by two bits.
As to your assertion that you can compress the resulting strings (last line of your post), you can try to compress the data further, but I would not expect much additional compression because the input data would appear essentially random. 2,1,1 probably would not compress very much because the overhead to store the positions of the commas would be as large (or larger) than the original (plaintext) data. I would represent the data this way. One lead byte where the two most significant bits represent the status code. 00 would represent unfactored (sieving). 01 would be unfactored (deeper factoring). 10 would represent a Mersenne prime. 11 would be factored. The rest of the first byte would depend on the status code. If the number is not factored (sieving), the remaining six bits would present the factor level less some minimum value. For example, if you chose not to represent any factorization less than 32, represent 32 as 00 0000, up to 95 as 11 1111. If the number was factored to 96 bits or more, you would represent the number of bytes requred by the factorization in the remaing six bits of the lead byte, and then represent the factor level as a string of bytes. If the number were a Mersenne prime, then its position in the prime table would go into the next six bits. If the number were factored, then you would indicate the number of bytes required by the factor in the last six bits. If necessary, the factor level or the factor would appear next in binary. Finally, I would represent the exponent in binary. If most of your exponents needed only four bytes to represent, I would take advantage of that information and use leading zeros to create four byte values. If you have a mixture, you might want to define some bits in the first byte to tell you how many bytes you need to represent the number, followed by the number. I hope that this is helpful. Is this only an academic exercise or does it have some application? I trust I have not just done your homework for you. 
20051004, 14:21  #7 
May 2005
Naperville, IL, USA
C4_{16} Posts 
I thought about this some more (sometimes, I do my best thinking in the shower ) and I came up with another representation. I think this one will be bigger for small data but smaller for big data. You would need to try encoding a subset of your data and decide what works better for you.
I am numbering the bits by the power of 2 that each one represents: 76543210. Byte 1, bits 6 and 7: 00 = unfactored, 01 = Mersenne, 10 = factored Byte 1, bits 4 and 5: If unfactored or Mersenne, the number of bytes for the factor level or table entry less 1, otherwise set to zero and ignored Byte 1, bits 0 through 3: the number of bytes for the exponent If factored, Byte 2 would indicate the number of bytes for the factor, followed by the factor. Otherwise, Byte 2 would start the representation of the factor followed by the representation of the exponent. Here is an example. I just finished factoring M35665369 to 62 bits. I would store this as follows: Byte 1: 00 00 0100 (unfactored, 1 byte for factor level, 4 bytes for exponent) Byte 2: 00111110 (62 in binary) Bytes 3  6: 00000010 00100000 00110101 11011001 (35665369 in binary) This would handle a Mersenne exponent up to 16 bytes long and a factor up to 255 bytes long. The bottom line is that the best data structure for compression really depends on the data. I think what I have presented here would do quite well for the representation of numbers while they are being factored. However, it does not do very well with presenting the exponents that have been factored. Rather than trying to shoehorn both types of data into one representation, maybe you would do better by segregating the factored numbers and representing them separately. Then, you could better apportion the number of bits in Byte 1 to represent the maximum number of bytes for the factors you want to store versus the number of bytes of the exponent. Remember, you can "stretch the bits" by defining a minimum factor or exponent size and storing only the offset (for example, if the minimum size is 2, you can store 2 [as zero] through 17 [as fifteen] in four bits). Again, I hope that this helps. 
20051005, 02:48  #8 
Jun 2003
Russia, Novosibirsk
2×107 Posts 
Ok, that for the discussion! I'll try today to implement your notices and will see if they helped.
Now I have advanced implementation of my method. I'll optimize it a bit and will upload to MP site for everyone to try it. 
20051005, 03:11  #9 
Jun 2003
Russia, Novosibirsk
326_{8} Posts 
 Now, the finall RC technical task: 
We have database that store data for GIMPS project: It contains: unfactored expos, mersennes, dividers, LLT and DLLT  tested expos. The limits: eponents are 2  9.999.999.999 (i think now there is no need to store huger expos :) Statuses for expos: 0  unfactored (having some factoring bit level) 1  Mersenne (enumerating number if Mersenne's table) 2  has divider 3  DLLT (having some factoring bit level) 4  LLT (having some factoring bit level) We need to compress it 
20051005, 03:56  #10 
Jun 2003
Russia, Novosibirsk
2·107 Posts 
I found if combining bitcompression and byte compression, we can provide extra compression. The algo is known as KBBC (Knoppix Bit/Byte Compression)  is development of VSC.Group Studio 2005. Sorry, but no opened info yet. Here are implemetation for mersennes:
1. First byte is divided into High (4 bit) and Low (4 bit) Low bit contains: 0000  Mersenne (116)  this case High contains number 116 (enumerating position) 0001  Mersenne (1732)  this case High contains number 1732 (enumerating position) 0010  Mersenne (3348)  this case High contains number 3348 (enumerating position). It is current + 6 overtake! This first 3 Low records contains all info needed to store data for Mersennes 148. As we know all of 142 of them, no need to store any more data in compression algo. Last fiddled with by HiddenWarrior on 20051005 at 03:57 
20051005, 05:46  #11 
May 2005
Naperville, IL, USA
304_{8} Posts 
HiddenWarrior, I don't want to ignore your last two posts. However, I have been thinking more about the scheme I proposed earlier and just want to flesh it out a bit more. It might not be so necessary to segregate the factored from the unfactored numbers. I believe I read recently that one of the "relatively small" Mersenne numbers has a smallest factor of more than 500 bits. I was somewhat daunted by the prospect of needing to represent this number. But, I think I may have another idea that allows that to be done.
I would store the Mersenne exponent first as a field five bytes long (2^40  1 = 1,099,511,627,775 which is larger than the 9,999,999,999 maximum you stated  however four bytes is too small and I think you would sacrifice too much performance trying to use only four and onehalf bytes for this purpose). The next byte would store the number status in its top three bits (you defined a total of five status values to store). If the number is a Mersenne prime, you can use the remaining five bits of the first byte to represent its place in the Mersenne Pantheon (up to 63). If you think we will discover more than 63 Mersenne primes, then don't store that number here. Instead, use the scheme for storing the factoring depth discussed immediately below. If the number remains in progress, the remaining five bits of the first byte store the number one smaller than the number of bytes required to represent the length of the number. This provides huge flexibility regarding the size of eventually found factors at the expense of some additional overhead. A zero in the bit field would mean that the factoring level (or Mersenne position) is somewhere between zero and 255 and that number would be stored as a onebyte quantity. A one in the bit field would mean that the factor or factoring level requires between 2 and 255 bytes to represent. The first byte would contain the length and the next "n" bytes would contain the factor (most likely). A two in the bit field would mean that the factor requires between 256 and 65535 bytes to represent. The first two bytes would contain the length and the next "n" bytes would contain the factor. And so on. If the length of a factor required 32 bytes in its representation, we have a new class of problems. I hope that this is clear and that it might be helpful to someone. I think it would help if you would please explain what it is you are trying to accomplish. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
An aliquot sequence with huge, huge, huge tracts of sand !!!  garambois  Aliquot Sequences  50  20120119 18:25 
That is a HUGE factor  Dubslow  Information & Answers  9  20110909 06:01 
New York: Corpse Wheeled to CheckCashing Store  ewmayer  Lounge  2  20080115 16:18 
store prices  Fusion_power  Puzzles  7  20030831 01:37 
I hope the new Apple store in Chicago opening today...  Paulie  Software  4  20030716 14:19 