Internal:Scrum: Difference between revisions

From iDigBio
Jump to navigation Jump to search
No edit summary
(Replaced content with " Planning Meeting #1 11/29/11")
 
Line 1: Line 1:
== Planning Meeting #1 11/29/11 ==
[[Internal:Scrum:Planning 20111129 | Planning Meeting #1 11/29/11]]
 
{{:projects:idigbio:meetings:diag.jpg?200|}}
 
=== Problems ===
 
==== Problems with Storage ====
 
* Full text search at scale - use Riak?
* File system - cost, scale, access via web -> use object store, swift
* Efficient text store - divide data into text and objects
* Too big/unresiliant for 1 location
* Federation & replication is hard - Swift sort of has repl, Riak does but $$
* "Backup" and "Archive"
* Mapping many:many between imgs + specimens
 
==== Problems /w Local Data Processing ====
 
* Iteration / MapReduce performance (Riak may support natively)
** API programming ease
** File system support
* Access control / Metering / Monitoring & Policy
* Appliance vs service vs vm
* Port existing tools to run on our system
* Download results vs update iDigBio
* Image processing
 
==== Problems with Data Exposure ====
 
* Large requests eg results for "US" - metering, rate limiting
* Formats - JSON, XML, CSV - and heiarchical data
* Programatic access efficiency / latency, for r in set do
* API bindings for used languages
* Usage tracking
 
==== Problems with Portal ====
 
* Visualization depends on - geolocation - base mapping layers
* Full text/faceted search performance
* Taxon matching needs high quality name resolution service
* Comparison to existing portals
* Web design quality
* Typical software feature requests and bugs from users - bugs -> internal redmine (poor auth integration)
* Feedback, usage tracking
 
==== Problems w/ Peers and Partners ====
 
* How much data do peers get -
* Sharding and reassembly between peers - force full copy of metadata, shard objects - best to have one place with all images but allow peers to mirror sets
* Replication protocols - OAI-PMI
** multimaster updates
** data provenance and residnancy tracking
* Peer training and technology skills to run our stack - or packaging for simplicity
* Usage tracking of remote data access
* Peer storage of object versions
 
==== Problems with Ingestion ====
 
(bad data)
* Field mapping - standardize on Darwin, Audobon, etc
* Taxa Name -> LSID?....
* Georeferencing, provided, data importation, quality check
* Outlier detection/correction
* Staging/preview area assist with above
* Whole data set verses updates -> frequency of updates, DWC archive, TAPIR
* Human-in-loop vs Bulk
(good data)
* Specimen / Occurrence / Image ID [Local] -> GUID/URI (assign LSID range to each provider) [Global]
* Provenance Tracking - Collection, TCN, Uploader, Residency
* Versioning - overwrite vs append
* Required field set - GBIF minimum, not images
* Accepted protocols - TAPIR, DWC Arc, OAI-PMH, native to app, CSV, XLS, SQL
 
 
=== Tasks ===
 
Dec 11
* M- Fix c11node22
* A- Swift -> use 5 (6) nodes on c11
* A- Riak -> install on 5 (6) nodes on c11
 
Dec 17
* M+K+J- Sample dataset -> Ask Kate
* A+M- Select iDigBioCore from DWC + extensions
* M- Push sample data in + convert to GeoJSON
* A- Experiment/design som Riak queries - check performance, pick indexes
 
Dec 24
* A+M- Pick facets for search
* A- Faceted web search, output list
* A- GeoJSON + Polymaps
 
=== Future Sprints ===
 
* Does it scale?
* Get more (good) data
* Get more (bad) data
* 3rd party API (low level)
* API scale & access
* Peering

Latest revision as of 17:13, 13 December 2011