Internal:Scrum:Planning 20111129

From iDigBio
Revision as of 17:14, 13 December 2011 by Mcollins (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Planning Meeting 11/29/2011

Diag.jpg

Problems

Problems with Storage

  • Full text search at scale - use Riak?
  • File system - cost, scale, access via web -> use object store, swift
  • Efficient text store - divide data into text and objects
  • Too big/unresiliant for 1 location
  • Federation & replication is hard - Swift sort of has repl, Riak does but $$
  • "Backup" and "Archive"
  • Mapping many:many between imgs + specimens

Problems /w Local Data Processing

  • Iteration / MapReduce performance (Riak may support natively)
    • API programming ease
    • File system support
  • Access control / Metering / Monitoring & Policy
  • Appliance vs service vs vm
  • Port existing tools to run on our system
  • Download results vs update iDigBio
  • Image processing

Problems with Data Exposure

  • Large requests eg results for "US" - metering, rate limiting
  • Formats - JSON, XML, CSV - and heiarchical data
  • Programatic access efficiency / latency, for r in set do
  • API bindings for used languages
  • Usage tracking

Problems with Portal

  • Visualization depends on - geolocation - base mapping layers
  • Full text/faceted search performance
  • Taxon matching needs high quality name resolution service
  • Comparison to existing portals
  • Web design quality
  • Typical software feature requests and bugs from users - bugs -> internal redmine (poor auth integration)
  • Feedback, usage tracking

Problems w/ Peers and Partners

  • How much data do peers get -
  • Sharding and reassembly between peers - force full copy of metadata, shard objects - best to have one place with all images but allow peers to mirror sets
  • Replication protocols - OAI-PMI
    • multimaster updates
    • data provenance and residnancy tracking
  • Peer training and technology skills to run our stack - or packaging for simplicity
  • Usage tracking of remote data access
  • Peer storage of object versions

Problems with Ingestion

(bad data)

  • Field mapping - standardize on Darwin, Audobon, etc
  • Taxa Name -> LSID?....
  • Georeferencing, provided, data importation, quality check
  • Outlier detection/correction
  • Staging/preview area assist with above
  • Whole data set verses updates -> frequency of updates, DWC archive, TAPIR
  • Human-in-loop vs Bulk

(good data)

  • Specimen / Occurrence / Image ID [Local] -> GUID/URI (assign LSID range to each provider) [Global]
  • Provenance Tracking - Collection, TCN, Uploader, Residency
  • Versioning - overwrite vs append
  • Required field set - GBIF minimum, not images
  • Accepted protocols - TAPIR, DWC Arc, OAI-PMH, native to app, CSV, XLS, SQL


Tasks

Dec 11

  • M- Fix c11node22
  • A- Swift -> use 5 (6) nodes on c11
  • A- Riak -> install on 5 (6) nodes on c11

Dec 17

  • M+K+J- Sample dataset -> Ask Kate
  • A+M- Select iDigBioCore from DWC + extensions
  • M- Push sample data in + convert to GeoJSON
  • A- Experiment/design som Riak queries - check performance, pick indexes

Dec 24

  • A+M- Pick facets for search
  • A- Faceted web search, output list
  • A- GeoJSON + Polymaps

Future Sprints

  • Does it scale?
  • Get more (good) data
  • Get more (bad) data
  • 3rd party API (low level)
  • API scale & access
  • Peering