« September 2011 | Main | November 2011 »

October 22, 2011

On Writing

On WritingOn Writing by Stephen King

My rating: 4 of 5 stars


I found this an intriguing book but didn't absorb as much of the advice as I should of. The biographical bookends offer a peak into why Stephen King writes the type of novels he does along with his harrowing tale of being hit by a car. Make no mistake the author has strong opinions about what works in writing and how to become a writer. He firmly believes that most will never become great writers but that it possible to go from being a competent writer to a good writer by following the sage advice he lays out in the heart of the book. As someone that doesn't have plans to become a professional writer if I'm able to take away one or two tidbits I know that will help.



View all my reviews

Tags: books

October 16, 2011

Mongo Boston 2011

Below are the notes that I took during the Oct 3, 2011 Mongo Boston Conference. If you notice any mistakes in my notes from what the excellent speakers said please let me know.

Welcome and What's New in MongoDB 2.0

Key objectives with MongoDB


  • JSON

  • General purpose

  • Scaling

  • Few knobs


What's new in 2.0


  • Journaling (on by default for 64-bit) (getLastError enhanced with flag to ACK it was written to journal)

  • Compact can now be run in parallel

  • Replica sets now support tags (useful for data center awareness in queries and replication)

  • Improved concurrency many locks are now yielded

  • Smaller faster indexes


Morphia: Easy Java Persistence

Morphia

Built on top of the 10Gen java driver.
Type-safe.
Uses JPA like annotations (@Entity, @Id, @Reference, @Serialized, @Transient, @Property, @Index, @Indexed, @NotSaved).
Speaker made heavy use of lombok library for automatic get/set and toString method handling.
Can handle ObjectId (Monog default) or custom Ids on objects.
Fluint like query interface.
Hooks for (@PrePersist, @PreSave, @PostPersist, @PreLoad, @PostLoad, @EntityListeners).
Does add className field to every saved object.
Supports loading a document into a different class and won't blow up on missing fields.
Can lazy load via Ref or Key (better to use).

Journal and Storage Engine

Separate file per DB, uses pre-allocation so there is always a spare file, largest file is 2GB.
Can get details of extends a collection is using through flags to validate.
Uses memory mapped files this means virtual size of MongoDB process is data size + overhead. With journaling it is overhead + 2 x data-size.
By default journaling fsyncs every 60seconds can tune with --syncDelay, doesn't nothing if there is nothing to flush.
If you have lots of inserts better to sync more frequently to reduce the chance that the OS force flushes.
db.serverStatus() includes fsync behavior
Future version of MongoDB should add support for data and indexes being stored in separate files (i.e. ability to put on different spindles/SSD)
Best to put journaling files on separate spindle.
Use ext4 for xfs for file system (zfs possible but not 100% recommended yet)
Journal on same spindle as data/index leads to 5-30% slowdown, while only 3% when on a separate spindle, this is very true when running on EC2/EBS
Things to watch out for:


  • transaction log wrapping due to large data transactions (can tune)

  • file fragmentation due to document size changes
    Changes to free list management coming to a future version of MongoDB which should help tremendously.
    Compact process is to run on secondaries and then step down the primary


Schema Design at Scale

Three keys things to consider:


  • embedding

  • indexes

  • sharding


Embedding (used example of a blog post with comments)


  • Embedded documents great for reads and reducing round-trips

  • Non-Embedded great for writes

  • Hybrid non-embedded with bucket structure using $push, $inc, and $pull

  • Hybrid model leads to variable size of items in buckets due to race conditions and deletes (not horrible to work around)

  • Can speed things up by having first bucket embedded in document

  • Use semantic key as id whenever possible

  • GUID for bucket objects since array ordering will vary


Indexing


  • Want right balanced tree insertion (using auto increment, or time based)

  • Covered indexes rock (given index for A,B, don't need index for A, queries that just return A+B only access index)

  • Always specify fields to return and must explicitly exclude _id if not needed

  • Explain is your friend, log bugs in JIRA if an index isn't being used


Sharding


  • Chunks try to be 64MB ro 100,000 Objects

  • Sharding key is hard to change but ranges can more easily change

  • Shard key options:

    • Auto incrementing: leads to hot spots

    • MD5(data): good r/w, bad index working set size

    • yearmonth + md5(data): leads to good distribution and good indexes




High-throughput data analysis

Worked on project that was processing on average 5GB/hour for 24 hours
Build using Lean startup approach: Idea -> Build, Code -> Measure, Data -> Learn

Listener


  • Written in C

  • Data sources stream updates to listener that buckets data into 1 second chunks

  • Used one socket per data type (12 in total)

  • Calculates delta between events before written into Mongo

  • Writes to different collection to indicate latest chunk written


Map Reduce


  • Using Mongo's Map Reduce

  • Map Reduce runs every second

  • Process each chunk in parallel


Replication and Replica Sets

Can configure a slave to get a delayed ops log (help against fat finger issues)
Manage through: rs.initiate, rs.add, rs.remove, rs.reconfigure, rs.stepDown
Op log application is idempotent
Currently limited to 12 members of a replica set (of which 7 exchange heartbeats and 5 non-voting members)
Heartbeats fire between all replica set members apx. every 200ms which means .5 - 2 sec lag if new primary is needed
Priority determines chance to be elected, however secondary with newest op log has precedence
2.0 adds tags which can be used with getLastError() to ensure things like cross datacenter replication (given tags like dc.ny, dc.sf)
Tag rules can be fairly complex
Smarter geographically distributed replication coming in the future

Optimizing MongoDB: Lessons Learned at Localytics

Tips:


  • Use short names for fields (recommend single letter)

  • Store GUIDS/long IDs with BinData()

  • Overrride _id with semantic key

  • Pre-aggregate with $inc

  • Prefix indexes with suffix buckets

  • Sparse indexes define with sparse: true with create

  • Keep index working set in RAM


Indexes:


  • Use indexes

  • Only query for what you want, use limit() and findOne()

  • Specify fields to return

  • Covering indexes are awesome

  • Exclude _id if you don't need it or is not part of index

  • To avoid lock wait time (not as true in 2.0) do a fetch then update


Fragmentation:


  • Setup inserts so they are right balanced into indexes

  • Manual padding if you know the document will expand in the future (BinData)

  • Beware that compact will remove padding


Performance:


  • Hash leads to all B-Tree access

  • ObjectId has temporal prefix

  • Can turn off automatic balancer and do manual chunk assignment

  • In EC2 running on an m1 or m2 will most likely give you an isolated box

  • Using m2 for high memory mongods

  • Using micro for arbiters and configs

  • Using EBS and RAID0 for writes and RAID10 for reads

  • Use at least 4-8 EBS volumes, 8-16 recommended

  • Determined by testing with a 10:1 read:write ratio with a working set size larger than RAM

  • Need shard key in every key


Text Analysis Using MongoDB

Creating features from unstructured data and storing features in Mongo
Running Hadoop process on same machine as MongoDB
Map/Reduce written in MongoReduce
Using Thrift
Have low-latency process running on same machine as MongoDB
Use Mongo configuration to determine what servers to talk to (single source of truth)

The Mathematics of Life

The Mathematics of LifeThe Mathematics of Life by Ian Stewart

My rating: 3 of 5 stars


The book explores the author's premise that mathematics is the sixth revolution to impact biology following those of the microscope, classification, evolution, genetics, and DNA's structure. The first third of the book examines the history of each of those revolutions and the impact on biology. The remainder of the book consists of vignettes about the interplay between mathematics and biology. The breadth of material exposed me to fascinating tidbits about animal patterns, evolutionary niches, and general biology. Alas the chapters are loosely coupled and the drive to prove that mathematics is the sixth revolution for biology is barely mentioned at all. In fact the book concludes that "I doubt that mathematics will ever dominate biological thinking in the way it now does for physics, but its role is becoming essential." While I found the book enjoyable to read it's lack of depth or cohesiveness left me wanting more.



View all my reviews

Some notes from my reading:


  • Prokaryotes reproduce by splitting into two copies: this process is called binary fission. Eukaryotes also split into two copies, but because such cells are more complex, their division is also more complex. Additionally, eukaryotes are usually capable of sexual reproduction, ... [84]

  • The relationship between genes and organisms is a feedback loop: genes affect organisms via development; organisms affect genes (in the next generation) via natural selection. [103]

  • Science is seldom about direct observation: it is nearly always about indirect inference. [221]

  • But sometimes [evolution] levers them apart, because several distinct survival strategies can exploit the environment more effectively than one. [238]

  • Now the program can replicate the robot when the robot obeys the instructions, and the robot can replicate the program by copying it but not obeying the instructions. [282]

  • This is one of several ways in which 'deterministic' and 'predictable' differ in practice, despite being essentially the same in principle. [285]

  • Cohen distinguished these two types of feature, calling them universals [likely to be found in alien life forms] and parochials [merely accidents of evolution]. ... Five digits on a hand is a parochial, but appendages that can manipulate objects are universal. [295]

Tags: books mos