Thursday, September 6, 2012

network server performance improvements - a real illustration (Guardian)

Here's the infrastructure. Exactly one microagent is installed on each machine to be monitored. An "environment" is defined by a base name on a machine. Strictly 1:M between micro agents and environments. Suppose we have 2 micro agents (on 2 machines) and 3 environments under each, so 6 distinct environments in total. A command like "dir" or "ipconfig" can execute in one of the 6 environments such as Environment #1. We can also run the same command on Environment #2, Environment #3, #4, #5, or all 6 environments. Another command "path" can also hit any environments.

If we single out one microagent, one enviroment under it, and run one command against it, the command output would be the status of one "service". So a service is identified by a tuple of 3 things - a particular microagent, a particular environment, and a particular command. If we have 2 microagents, 3 environments under each, and 4 commands, then we could have up to 24 services. I use many different terms to refer to a service.

Sometimes I call it a query. You keep firing the same query to get updates from the microagent.
Sometimes I call it a chat room. All clients registered for that CR would get all updates.
Sometimes I call it a message generator.

For each service, GUI clients would continuously monitor its status. Client connects to server to subscribe to updates. Server maintains about 100 "services" like chat rooms, and each one generates messages once a few seconds. Server would push them to the registered clients, using WCF.

In terms of topology, just one server instance in the network, at least 3 microagent-enabled app server machine, and many, many client machines.

----
Trick: Server doesn't know when one of the registered clients is offline so I often notice it sending updates to 13 client when only 1 or 0 client is alive. I created a dictionary of connections using their IP address as key, so we won't have 2 duplicate clients to update, since one of them must be dead.

Trick: many msg generators ("services" or "chat rooms") share the standard update interval of 60 seconds. Each is driven by a private timer. The timers start at server start time, but I decided to use different initial delays. Therefore one generator fires on the 1st sec of every minute, another generator would fire on the 2nd sec every minute. Spread out the load on all parties.

Trick: when a microagent is offline, the central server would keep hitting it as per schedule (like every 5 sec) driven by the timer. Expensive because the thread must block until timeout. I decided to reduce the timer frequency when a microagent is seen offline. Restored after microagent becomes reachable again.

Trick: some queries on the microagent take a long time (20 sec). Before first query completes, 2nd query in the series could hit the same microagent, overloading both sides. I decided to set a busy flag on each query, so next time a thread from the thread pool "wants" to fire the query, it would see the flag and simply return, without blocking.

No comments:

Total Pageviews

my favorite topics (labels)

_fuxi (302) _misLabel (13) _orig? (3) _rm (2) _vague (2) clarified (58) cpp (39) cpp_const (22) cpp_real (76) cpp/java/c# (101) cppBig4 (54) cppSmartPtr (35) cppSTL (33) cppSTL_itr (27) cppSTL_real (26) cppTemplate (28) creditMkt (14) db (65) db_sybase (43) deepUnder (31) dotnet (20) ECN (27) econ/bank` (36) fin/sys_misc (43) finGreek (34) finReal (45) finRisk (30) finTechDesign (46) finTechMisc (32) finVol (66) FixedIncom (28) fMath (7) fMathOption (33) fMathStoch (67) forex (39) gr8IV_Q (46) GTD_skill (15) GUI_event (30) inMemDB (42) intuit_math (41) intuitFinance (57) javaMisc (68) javaServerSide (13) lambda/delegate (22) marketData (28) math (10) mathStat (55) memIssue (8) memMgmt (66) metaProgram` (6) OO_Design (84) original_content (749) polymorphic/vptr (40) productive (21) ptr/ref (48) py (28) reflect (8) script`/unix (82) socket/stream (39) subquery/join (30) subvert (13) swing/wpf (9) sysProgram` (16) thread (164) thread_CAS (15) thread_cpp (28) Thread* (22) timeSaver (80) transactional (23) tune (24) tuneDB (40) tuneLatency (30) z_ajax (9) z_algoDataStruct (41) z_arch (26) z_arch_job (27) z_automateTest (17) z_autoTrad` (19) z_bestPractice (39) z_bold (83) z_bondMath (35) z_book (18) z_boost (19) z_byRef^Val (32) z_c#GUI (43) z_c#misc (80) z_cast/convert (28) z_container (67) z_cStr/arr (39) z_Favorite* (8) z_FIX (15) z_forex (48) z_fwd_Deal (18) z_gz=job (33) z_gzBig20 (13) z_gzMgr (13) z_gzPain (20) z_gzThreat (19) z_hib (19) z_IDE (52) z_ikm (5) z_IR_misc (36) z_IRS (26) z_javaWeb (28) z_jdbc (10) z_jobFinTech (46) z_jobHunt (20) z_jobRealXp (10) z_jobStrength (15) z_jobUS^asia (27) z_letter (42) z_linq (10) z_memberHid` (11) z_MOM (54) z_nestedClass (5) z_oq (24) z_PCP (12) z_pearl (1) z_php (20) z_prodSupport (7) z_py (31) z_quant (14) z_regex (8) z_rv (38) z_skillist (48) z_slic`Problem (6) z_SOA (14) z_spring (25) z_src_code (8) z_swingMisc (50) z_swingTable (26) z_unpublish (2) z_VBA/Excel (8) z_windoz (17) z_wpfCommand (9)

About Me

New York (Time Square), NY, United States
http://www.linkedin.com/in/tanbin