Saturday, February 5, 2011

XR's memory efficiency in large in-memory search (DB diff via xml

A real world Problem: 2 DB tables have very similar content. The owner decided to periodically export each table into a 200MB xml file. Each row becomes an element in xml. Each field an attribute. Business Users want to compare the rows on-demand via a servlet. Need to show per-field differences.

Performance? A few seconds to compare the xml files and show result.

Due to various overhead while reading and write files, -Xmx1000m is required.

Insight -- duplicate object is the chief performance issue. Key technique is an object registry (like SSN in US), where each recurring object (strings, Floats too!) is assigned an 32-bit Oid. Dates are converted to dates then long integer (memory efficient).  To compare 2 corresponding fields, just compare their Oid. String comparison is much slower than int comparison.

Use an AtomicInteger as Oid auto-increment generator -- thread-safe.

Solution:
- load each XML into memory, but never to instantiate 2 identical objects. Use the Oid.
** construct an {Object,Oid} hashmap?
- now build object graphs, using the Oid's. Each row is represented by an int array
- 1m rows become 1m int arrays, where 0th array element is always Column 0, 1st element always Column 1 ...
- Since Column 3 is primary key, we build a hashmap (primary key Oid, int array) using 1st xml file
- now we go through the 1m rows from 2nd xml file. For each row, we construct its int array, use the primary key's Oid to locate the corresponding int array from the hashmap, and compare the 2 int arrays.

Note xml contains many partial rows -- some of the 50 fields might be null. However, every field has a name. therefore our int-array is always 50 but with nulls represented by -1.

Table has 50 columns, 400,000 rows. Each field is char(20) to char(50).

No comments:

Total Pageviews

my favorite topics (labels)

_fuxi (302) _misLabel (13) _orig? (3) _rm (2) _vague (2) clarified (58) cpp (39) cpp_const (22) cpp_real (76) cpp/java/c# (101) cppBig4 (54) cppSmartPtr (35) cppSTL (33) cppSTL_itr (27) cppSTL_real (26) cppTemplate (28) creditMkt (14) db (65) db_sybase (43) deepUnder (31) dotnet (20) ECN (27) econ/bank` (36) fin/sys_misc (43) finGreek (34) finReal (45) finRisk (30) finTechDesign (46) finTechMisc (32) finVol (66) FixedIncom (28) fMath (7) fMathOption (33) fMathStoch (67) forex (39) gr8IV_Q (46) GTD_skill (15) GUI_event (30) inMemDB (42) intuit_math (41) intuitFinance (57) javaMisc (68) javaServerSide (13) lambda/delegate (22) marketData (28) math (10) mathStat (55) memIssue (8) memMgmt (66) metaProgram` (6) OO_Design (84) original_content (749) polymorphic/vptr (40) productive (21) ptr/ref (48) py (28) reflect (8) script`/unix (82) socket/stream (39) subquery/join (30) subvert (13) swing/wpf (9) sysProgram` (16) thread (164) thread_CAS (15) thread_cpp (28) Thread* (22) timeSaver (80) transactional (23) tune (24) tuneDB (40) tuneLatency (30) z_ajax (9) z_algoDataStruct (41) z_arch (26) z_arch_job (27) z_automateTest (17) z_autoTrad` (19) z_bestPractice (39) z_bold (83) z_bondMath (35) z_book (18) z_boost (19) z_byRef^Val (32) z_c#GUI (43) z_c#misc (80) z_cast/convert (28) z_container (67) z_cStr/arr (39) z_Favorite* (8) z_FIX (15) z_forex (48) z_fwd_Deal (18) z_gz=job (33) z_gzBig20 (13) z_gzMgr (13) z_gzPain (20) z_gzThreat (19) z_hib (19) z_IDE (52) z_ikm (5) z_IR_misc (36) z_IRS (26) z_javaWeb (28) z_jdbc (10) z_jobFinTech (46) z_jobHunt (20) z_jobRealXp (10) z_jobStrength (15) z_jobUS^asia (27) z_letter (42) z_linq (10) z_memberHid` (11) z_MOM (54) z_nestedClass (5) z_oq (24) z_PCP (12) z_pearl (1) z_php (20) z_prodSupport (7) z_py (31) z_quant (14) z_regex (8) z_rv (38) z_skillist (48) z_slic`Problem (6) z_SOA (14) z_spring (25) z_src_code (8) z_swingMisc (50) z_swingTable (26) z_unpublish (2) z_VBA/Excel (8) z_windoz (17) z_wpfCommand (9)

About Me

New York (Time Square), NY, United States
http://www.linkedin.com/in/tanbin