Sunday, September 9, 2007

listing customers with 0 payment history

listing customers with 0 payment history (according to the Payment table) -- A simple anti-join, well covered in literature.

Customer_table (cid...) with primary key on cid
Payment_table (cid, date, amount ..) with non-unique index on foreign-key cid

S1 (uncorrelated): select C.cid
from C
where C.cid not in
(select distinct cid from P)

S2 (correlated) -- thanks to index, no FTS:
select C.cid
from C
where not exists
(select * from P
where C.cid = P.cid)

[1] One of the first key observations is the subset relationship between the prikey C.cid and forkey P.cid, and consequently the referential integrity.

[2] Second key observation is that P.cid is (perhaps heavily) repetitive, if customers pay a few times a day (for a subway pass). I think it's imperative to use the index on P.cid.

[3] Third key observation is the absense of additional criteria => have to process every C.cid. In most real life queries, another where-criterion restricts us to a fraction of the C.cid values. In that case P.cid FTS is bad.

S1 subquery runs only once and reads every Payment_table page exactly once. S2 subquery runs once for each [4] row in Customer_table, reads every index entry but never reads the Payment_table. Due to caching, S2 loads every index data block exactly once.

[4] I think the repetitive run of the subquery is not as bad as unnecessary disk reads. Number of disk reads dominates SELECT performance. What's the disk vs memory speed ratio? Orders of magnitude! S1 reads all P.cid, S2 loads the entire index on P.cid. Which reads less? Also see http://www.onlamp.com/pub/a/onlamp/2004/09/30/from_clauses.html.

In any case, the correlated S2 always reads every distinct P.cid that matches C.cid, which is every[1] distinct P.cid /value/, but I think S2 reads only index, not the table. In most real-world queries, a index scan precedes a table read.

It is possible that FTS on P requires almost the same number of disk reads as a full index scan, if P.cid is close to unique. Without [2], disk reads related to P are identical between S1 and S2.


S3 (using distinct): -- no FTS on P, probably beats the anti-join recommendation in [[ oracle sql tuning ]].
select distinct C.cid
from C outer join P on cid
where P.cid = null

S4 (using count): select C.cid
from C outer join P on cid
having count(date) = 0
group by C.cid

S5 (using sum): select C.cid
from C outer join P on cid
having sum(amount) = NULL -- not 0. P 93 [[ intro to sql ]]
group by C.cid

S6 (using minus):

No comments:

Total Pageviews

my favorite topics (labels)

_fuxi (302) _misLabel (13) _orig? (3) _rm (2) _vague (2) clarified (58) cpp (39) cpp_const (22) cpp_real (76) cpp/java/c# (101) cppBig4 (54) cppSmartPtr (35) cppSTL (33) cppSTL_itr (27) cppSTL_real (26) cppTemplate (28) creditMkt (14) db (65) db_sybase (43) deepUnder (31) dotnet (20) ECN (27) econ/bank` (36) fin/sys_misc (43) finGreek (34) finReal (45) finRisk (30) finTechDesign (46) finTechMisc (32) finVol (66) FixedIncom (28) fMath (7) fMathOption (33) fMathStoch (67) forex (39) gr8IV_Q (46) GTD_skill (15) GUI_event (30) inMemDB (42) intuit_math (41) intuitFinance (57) javaMisc (68) javaServerSide (13) lambda/delegate (22) marketData (28) math (10) mathStat (55) memIssue (8) memMgmt (66) metaProgram` (6) OO_Design (84) original_content (749) polymorphic/vptr (40) productive (21) ptr/ref (48) py (28) reflect (8) script`/unix (82) socket/stream (39) subquery/join (30) subvert (13) swing/wpf (9) sysProgram` (16) thread (164) thread_CAS (15) thread_cpp (28) Thread* (22) timeSaver (80) transactional (23) tune (24) tuneDB (40) tuneLatency (30) z_ajax (9) z_algoDataStruct (41) z_arch (26) z_arch_job (27) z_automateTest (17) z_autoTrad` (19) z_bestPractice (39) z_bold (83) z_bondMath (35) z_book (18) z_boost (19) z_byRef^Val (32) z_c#GUI (43) z_c#misc (80) z_cast/convert (28) z_container (67) z_cStr/arr (39) z_Favorite* (8) z_FIX (15) z_forex (48) z_fwd_Deal (18) z_gz=job (33) z_gzBig20 (13) z_gzMgr (13) z_gzPain (20) z_gzThreat (19) z_hib (19) z_IDE (52) z_ikm (5) z_IR_misc (36) z_IRS (26) z_javaWeb (28) z_jdbc (10) z_jobFinTech (46) z_jobHunt (20) z_jobRealXp (10) z_jobStrength (15) z_jobUS^asia (27) z_letter (42) z_linq (10) z_memberHid` (11) z_MOM (54) z_nestedClass (5) z_oq (24) z_PCP (12) z_pearl (1) z_php (20) z_prodSupport (7) z_py (31) z_quant (14) z_regex (8) z_rv (38) z_skillist (48) z_slic`Problem (6) z_SOA (14) z_spring (25) z_src_code (8) z_swingMisc (50) z_swingTable (26) z_unpublish (2) z_VBA/Excel (8) z_windoz (17) z_wpfCommand (9)

About Me

New York (Time Square), NY, United States
http://www.linkedin.com/in/tanbin