Archive for the ‘code or systems’ Category

Shell trick: track new tcp connections per second in Linux

Thursday, February 11th, 2010

This little snippet is for when you want to see new active connections per second, not concurrent established, as most tools show you:


C=0; while true; do echo "new connections: $C"; c1=`netstat -s -t|grep "active connections openings"|awk '{print $1;}'`; sleep 1; c2=`netstat -s -t|grep "active connections openings"|awk '{print $1;}'`; C=`expr $c2 - $c1` ; done

Enjoy!

Ruby for are these chars in this string…

Thursday, February 11th, 2010

I find myself wanting to know if the characters in one string are in another.
For example, if you have a key space for a string key name, making sure the provided key is valid.


require 'set'
def chars_subset_of? checkme, inme
checkme.split(//).to_set.subset? inme.split(//).to_set
end

You could also toss this inside class String to dynamically add the method


require 'set'
def chars_subset_of? other_string
if other_string.class == String
self.split(//).to_set.subset? other_string.split(//).to_set
else # just try
self.split(//).to_set.subset? other_string.to_set
end
end

Appengine: Auccumulate and Rejoin Fragmented Data or Buffer Small Object Floods in Memcache

Thursday, February 11th, 2010

I put this together to solve a problem where contiguous data generated in JavaScript on the browser side needed to be broken into pieces and sent to the server and reconstructed there. The same tool could be used to buffer rapidly incoming small objects to queue for batch inserts for higher performance / less contention on the Appengine datastore.

Jump to the code. I apologize for the CSS failfailfail.

For the “large object out of many small fragments” use case:

The principle is the sender provides some kind of identifier that uniquely identifies the batch. I use a cookie that is generated on page load combined with a counter in the JS that is incremented once for each batch from that client. The cookie was conveniently already there for another purpose.

On the server, the library takes this identifier and uses it as the memcache root key or “groupkey” in the code. Each fragment sent should include the total expected, and the index in the array of fragments of the current fragment. On the server an atomic global counter keeps track of when all pieces are sent. The api allows you to use the library to preserve the array order, or you can just stick order info in the stored value and figure it out yourself when the server gives you the completed array of accumulated fragments. Very simply, the server uses the unique identifier both to store the fragments in memcache and to keep track of the fragment count, which when equal to the expected count raises a Complete exception, allowing you to fetch back an array of all the fragments. Typically this would be done inline with the request that’s sending the last missing fragment.

For the case of queueing up small objects for batch insert into BigTable:

This use case is obvious, and analysis is left to the reader as exercise (smile). Basically, the above concern about order is removed, and the reconstruction is unneeded.


Python DB Logging Handler for Google Appengine

Thursday, February 11th, 2010

NOTE: this is really only for CRITICAL logging you want to persist separately from google’s wrap of logging. You will burn up your quota otherwise. Main advantages: implements logging handler, maintains support for identities. Good for audit trails.

UPDATE: Appears the module level variable magic is not liked by appengine-patch. I switched to that from appengine helper because of hybrid auth convenience. I have updated to remove that part to prevent any module level variable wierdness. Updated code below. Alternative: just stick it inside your django tree.
— end UPDATE

I provide here a module that extends the Python logging framework to allow you to write messages to your Google appengine database. A database LogRecord model, Handler, Formatter, and logger manager are implemented.

( jump to the code )

Since you can’t store write access files on appengine, you have no easy way to separate your logging to different file handlers. The built in logging is nifty in the dev cycle. But a long term store is desireable, with the ability to set different log levels for different parts of your system.

So the goals I set out with were:

  • Persist all the usual log record info, including file, line, module, stack, etc. as available
  • Allow multiple logger identities for easy separation and grouping
  • Don’t break anything about logging module’s expected behavior
  • Provide a convience wrapper for zero-effort use

Now, in principle I’d generally avoid logging to a database, since log records are inherently denormalized, sequentially added, and typically read only after insert. They’re the perfect candidate for a flat file. But that is not an option here. Plus, I have confidence that the Appengine table structure should allow this to perform more or less like a flat file. Caveat: depending how you use this, you could hit your quotas substantially faster.

Quick note: this Handler provides its own formatter, which just shapes the data for the Appengine table. It doesn’t make sense to provide this a format string since it is not serializing the record. Similarly, I didn’t think it useful to trim off the lineno and funcname detail, since you can just select what you want from the table.

Otherwise, you use it just like any other logger from the logging module. When you actually write a message, a db entry is created for you. The default logger, which is created when getLogger() is called with no arguments. You can interact with the logger to set levels etc. via the object, as in
log = Log2DB.getLogger()
log.setLevel(logging.DEBUG).

Records created by this logger set the identity value as ‘Log2DB’. You can filter identities from the log table by setting identity = whatever you want.

For example:

import Log2DB
log = Log2DB.getLogger()
log.error('fun times')

import logging
log.logger.setLevel(logging.DEBUG)
log.debug('unfun detail')

All the above go to the default logger identity, “Log2DB”.
To create another logger to use, say for example you want one just for mailer status, you use the getLogger proxy function.

For example:

import Log2DB
maillogger = Log2DB.getLogger('MyMailLogger')
malllogger.error('oh noes')

try:
  raise HellException
except HellException, e:
  maillogger.exception('Error: %s stack to follow', e)

A record from the above is in the same table as the default log, but has identity = ‘MyMailLogger’. I thought about actually dynamically changing the log record model class name so you would get separate tables, but given the google architecture this has limited perf gain at cost of brittleness and complexity. I may add this anyway as an option.

Later I may paste a django view for managing the resulting records.

here is the code via pastie.org:

this replaces the old version at http://pastie.org/432156
moved _LogRecordHelper to inner class to fix an issue

Extend Rails ActiveRecord and ConnectionAdapter to support dirty reads on MySQL

Thursday, February 11th, 2010

Discussed here is a mixin that extends Rails to provide an easy method to switch your database session between clean and dirty reads, otherwise known as transaction isolation level.

( jump to the code )

At Rescuetime, our most interesting analysis of tracked time requires a lot of complicated data munging on the database side. We extensively optimize the structure of the datastore and the plan of queries to produce quick results. However, there is some, however small, amount of row scanning and index hunting that is inevitable. Some of these tables are also subject to rapid, row overlapping, simultaneous insert loads.

In general, reporting or analytical access has no worries about data being up to date to the nearest microsecond, although as near real time as possible is highly desired. This near real time goal rules out an ETL type solution. Additionally, there is the cost factor. If we can make this work on one database, why build two?

In the quest for minimal stress for the online system, we introduced a method for flagging database work to be dirty reads, thus preventing any kind of locking (especially index locking) on the rows in question, and applied these where possible. This is simple enough in straight SQL, but we wanted to expose it to Rails framework in a consistent manner.

What this code does is:

1) Provide stubs in the abstract database adapter for “dirty()”, “clean()”, “reallyclean()”
2) Implement them in the MySQL adapter
3) Expose them to ActiveRecord::Base as a class method, prefixed with “isolation_”

On #1, #2: the choice of really clean versus clean simply describes the read locking strategy used. EG if you have multiple selects in same transaction on same rows, “clean” returns same result from same snapshot. However “reallyclean” will return newer rows if they exist on the later selects.

On #3, the prefix “isolation_” is added (yielding “isolation_dirty() etc.) since there is already some semantic in place for “dirty” in ActiveRecord.

For MySQL we set:

dirty = READ UNCOMMITTED
clean = REPEATABLE READ
reallyclean = READ COMMITTED

See their reference.

Here is the code:


Just put this code in something like lib/mysql_adapter_extensions.rb in your project, then require that in some controller.

Example use:

Person.isolation_dirty()
result = Person.find_by_name params[:name] # some crazier query here
Person.isolation_clean()