Tuesday, May 15, 2012

store a billion phone numbers for fast lookup

The solution I managed to recall in 2015: tree. Each node has the count as payload, and up to 10 child nodes.
---
Q: efficiently store a billion phone numbers for fast lookup.

I assume realistic usage is read-mostly rather than read-only, but this solution is reasonably efficient in read-write scenarios too. My home-grown idea happens to resemble http://en.wikipedia.org/wiki/Trie.

Assumption -- phone numbers across the world have at most 15 digits i.e. short strings. Our trie would be at most 15 levels deep. Fan-out at any parent node is at most 10.

This 15-digit assumption is not as restrictive as it may appear. The distribution of any realistic "population" of interest to an information system is often /bounded/ by some important constraints. We seldom need to store data of arbitrary length or arbitrary precision.

Compared to a naïve array data structure (converting each phone number to integer and treat it as subscript) which requires 10^15 array elements (about a million billion elements), our trie would create only a billion leaf nodes and a comparable number [1] of intermediate nodes.

[1] I'd estimate this number like this -- on 14th level, there can be only a few hundred million nodes. On the 13rd lowest, even fewer. Phone numbers share area codes, so on the 4th or 5th lowest level, we have clusters.

Search is O(1) given the Assumption above -- just follow the pointers. I feel performance is more predictable than hash table and not susceptible to bad hashing or skewed distribution.

Insert is O(1) given the Assumption above.

How about a hash table? Most standard implementations use a fill factor below 100%, so we need a bucket array of 1billion+. Memory efficient. I'm not sure about quality of hash code distribution. I feel it's possible to get many collisions.

No comments:

Total Pageviews

my favorite topics (labels)

_fuxi (302) _misLabel (13) _orig? (3) _rm (2) _vague (2) clarified (58) cpp (39) cpp_const (22) cpp_real (76) cpp/java/c# (101) cppBig4 (54) cppSmartPtr (35) cppSTL (33) cppSTL_itr (27) cppSTL_real (26) cppTemplate (28) creditMkt (14) db (65) db_sybase (43) deepUnder (31) dotnet (20) ECN (27) econ/bank` (36) fin/sys_misc (43) finGreek (34) finReal (45) finRisk (30) finTechDesign (46) finTechMisc (32) finVol (66) FixedIncom (28) fMath (7) fMathOption (33) fMathStoch (67) forex (39) gr8IV_Q (46) GTD_skill (15) GUI_event (30) inMemDB (42) intuit_math (41) intuitFinance (57) javaMisc (68) javaServerSide (13) lambda/delegate (22) marketData (28) math (10) mathStat (55) memIssue (8) memMgmt (66) metaProgram` (6) OO_Design (84) original_content (749) polymorphic/vptr (40) productive (21) ptr/ref (48) py (28) reflect (8) script`/unix (82) socket/stream (39) subquery/join (30) subvert (13) swing/wpf (9) sysProgram` (16) thread (164) thread_CAS (15) thread_cpp (28) Thread* (22) timeSaver (80) transactional (23) tune (24) tuneDB (40) tuneLatency (30) z_ajax (9) z_algoDataStruct (41) z_arch (26) z_arch_job (27) z_automateTest (17) z_autoTrad` (19) z_bestPractice (39) z_bold (83) z_bondMath (35) z_book (18) z_boost (19) z_byRef^Val (32) z_c#GUI (43) z_c#misc (80) z_cast/convert (28) z_container (67) z_cStr/arr (39) z_Favorite* (8) z_FIX (15) z_forex (48) z_fwd_Deal (18) z_gz=job (33) z_gzBig20 (13) z_gzMgr (13) z_gzPain (20) z_gzThreat (19) z_hib (19) z_IDE (52) z_ikm (5) z_IR_misc (36) z_IRS (26) z_javaWeb (28) z_jdbc (10) z_jobFinTech (46) z_jobHunt (20) z_jobRealXp (10) z_jobStrength (15) z_jobUS^asia (27) z_letter (42) z_linq (10) z_memberHid` (11) z_MOM (54) z_nestedClass (5) z_oq (24) z_PCP (12) z_pearl (1) z_php (20) z_prodSupport (7) z_py (31) z_quant (14) z_regex (8) z_rv (38) z_skillist (48) z_slic`Problem (6) z_SOA (14) z_spring (25) z_src_code (8) z_swingMisc (50) z_swingTable (26) z_unpublish (2) z_VBA/Excel (8) z_windoz (17) z_wpfCommand (9)

About Me

New York (Time Square), NY, United States
http://www.linkedin.com/in/tanbin