Internal:Scrum

Revision as of 17:05, 13 December 2011 by Mcollins (talk | contribs) (Created page with "== Planning Meeting #1 11/29/11 == {{:projects:idigbio:meetings:diag.jpg?200|}} === Problems === ==== Problems with Storage ==== * Full text search at scale - use Riak? * Fil...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Planning Meeting #1 11/29/11

Projects:idigbio:meetings:diag.jpg?200

Problems

Problems with Storage

Full text search at scale - use Riak?
File system - cost, scale, access via web -> use object store, swift
Efficient text store - divide data into text and objects
Too big/unresiliant for 1 location
Federation & replication is hard - Swift sort of has repl, Riak does but $$
"Backup" and "Archive"
Mapping many:many between imgs + specimens

Problems /w Local Data Processing

Iteration / MapReduce performance (Riak may support natively)

 * API programming ease
 * File system support

Access control / Metering / Monitoring & Policy
Appliance vs service vs vm
Port existing tools to run on our system
Download results vs update iDigBio
Image processing

Problems with Data Exposure

Large requests eg results for "US" - metering, rate limiting
Formats - JSON, XML, CSV - and heiarchical data
Programatic access efficiency / latency, for r in set do
API bindings for used languages
Usage tracking

Problems with Portal

Visualization depends on - geolocation - base mapping layers
Full text/faceted search performance
Taxon matching needs high quality name resolution service
Comparison to existing portals
Web design quality
Typical software feature requests and bugs from users - bugs -> internal redmine (poor auth integration)
Feedback, usage tracking

Problems w/ Peers and Partners

How much data do peers get -
Sharding and reassembly between peers - force full copy of metadata, shard objects - best to have one place with all images but allow peers to mirror sets
Replication protocols - OAI-PMI

 * multimaster updates
 * data provenance and residnancy tracking

Peer training and technology skills to run our stack - or packaging for simplicity
Usage tracking of remote data access
Peer storage of object versions

Problems with Ingestion

(bad data)

Field mapping - standardize on Darwin, Audobon, etc
Taxa Name -> LSID?....
Georeferencing, provided, data importation, quality check
Outlier detection/correction
Staging/preview area assist with above
Whole data set verses updates -> frequency of updates, DWC archive, TAPIR
Human-in-loop vs Bulk

(good data)

Specimen / Occurrence / Image ID [Local] -> GUID/URI (assign LSID range to each provider) [Global]
Provenance Tracking - Collection, TCN, Uploader, Residency
Versioning - overwrite vs append
Required field set - GBIF minimum, not images
Accepted protocols - TAPIR, DWC Arc, OAI-PMH, native to app, CSV, XLS, SQL

Tasks

Dec 11

M- Fix c11node22
A- Swift -> use 5 (6) nodes on c11
A- Riak -> install on 5 (6) nodes on c11

Dec 17

M+K+J- Sample dataset -> Ask Kate
A+M- Select iDigBioCore from DWC + extensions
M- Push sample data in + convert to GeoJSON
A- Experiment/design som Riak queries - check performance, pick indexes

Dec 24

A+M- Pick facets for search
A- Faceted web search, output list
A- GeoJSON + Polymaps

Future Sprints

Does it scale?
Get more (good) data
Get more (bad) data
3rd party API (low level)
API scale & access
Peering

Retrieved from "http://www.idigbio.org/wiki/index.php?title=Internal:Scrum&oldid=93"