Importing Data

From ARK
Revision as of 20:19, 4 December 2012 by Guy (talk | contribs)

Jump to: navigation, search

As of the v1.0 release of ARK there is a set of data import tools within ARK. These tools speed up and improve the process of creating a concordance map between the source data and an ARK database. Never the less, using the tools will require a working knowledge of ARK's data structures. If you are unable to import your data or would like training in the ARK import tools, please contact L - P : Archaeology who provide both training and custom ARK installations.

Overview

The ARK import tools are designed to provide a means to import data from table based data. This can be either in the form of a series of tables from a relational database or from a spreadsheet. The tools map the fields in the data onto fields within the ARK data structure. Data can then be inspected before running an import.

The tools provide some join functionality and are intended to reduce the need for pre-processing of data where possible. Inevitably, there will be some need to pre-process data for import.

Concordance Maps

Each instance of ARK can contain multiple concordance maps. The concordance map holds the information (mappings) that explains to the import tools how a particular element of the data should be reworked for import to the ARK structure. The concordance map maps both 'structure' and 'data'.

Structure Mapping

This is the key part of each concordance map, it contains the mappings for each field in the source data that will be imported into the ARK database. Due to the way ARK data is structured, it is often the case that many source data fields are not needed for import.

Once mapped, a field can be imported and re-imported using the same mapping. Future version of ARK may support this as a method for dynamic updates of the data over the web.

To use a simple example, a field in the source data such as 'Description' which is a column in a source database, will be mapped to an ARK txttype ready for import.

Data

Data mapping maps certain terms or values in a source database to certain terms or values in the ARK database. For example, this permits a term such as 'castle' in a source database to be mapped to a term such as 'defensive structure' in a target database.

When importing from controlled lists, ARK makes new data mappings on the fly as it finds each new term within the list.

By manually creating a data mapping, it is possible to alter the way this data is imported without the need for pre-processing. This is especially important in the case of dynamically re-importing data.

How to import data

This gives an overview of how to import data into ARK using the built in import tools.

Source Data

The first task is to take a close look at the data you are about to import. Problems in the data will not be magically corrected by the import tools. If you put junk in you will get junk out.

At this stage the tools will only read data in from a source database on the same MySQL server. There is no reason you couldn't adapt the tools to easily work with another database such as Postgres, but at this stage, the tools will not read in data external to the server. The first task is therefore to import your data to the server. You can either create tables on the target database or set up a separate 'source database'.

The easiest way to get data onto the server is probably to save each table as a comma separated values file and import this to the server using phpMyAdmin.

Be sure to prefix source database tables with "import_" this will indicate to the import tools that you wish to map data in this table. Look up tables and other supplementary tables which will not be directly imported do not need this prefix.

As a rule, you should name your columns with sensible names, avoid fancy characters and so on. Try to make the names unique and memorable.

It is often simplest to create the table on the database first and then simply import the data into this table (and its defined columns), although this is up to you.

Once you have imported your source data to the server, it is time to start mapping your databases.

  • UID Column - Any table that you prefix 'import_' and intend to import data from MUST have a column which contains a unique ID for each row of the table. This is used to loop over the data by the tools, it is not imported in anyway, but it is essential. This can in fact be a column containing your 'key' data, although in certain circumstances it is desirable to create (manually) and new UID column on the source table specifically for the purpose of importing the data. See below for further information.

Concordance Map - CMAP

If you have not done so already, you will need to create a new Concordance Map. To do this, use the tool available on the left panel fo the import tools home page.

Fill in the required fields carefully (check spellings!) and save the new map to the database.

  • Nickname - A mnemonic, this can be anything you like that will help you remember the CMAP. Do NOT use spaces or funny characters here.
  • Description - A text field that you can use to describe this CMAP for future reference. This will accept UTF-8 characters etc.
  • Source DB - The exact name of the source database on this server. (May be the same as the ARK db)
  • Target Site Code - This is a default site code for the import. More complex options for this can be set in the structure map (which override this setting).

You can edit this map afterwards using the built in edit CMAP tool.

Structure Map - CMAP Structure

Once you have set up a CMAP, you can begin mapping fields in the source DB. The key thing here is to treat each and every field (column) in the source data s an independent entity. The import tools look only at a single column, they do not try to recreate your relational database beyond a few specific functions.

The recommended method is to look at each column in turn to decide if it will be imported into ARK or not. This is not as simple as it sounds so take time to analyse each column.

In order to set up each field for import, use the tools provided. On the left panel of the import pages, select 'Edit Structure Map'. The tools provided list out all the 'import_' tables available on this CMAP in the right hand panel. In order to start working simple click the edit icon next to the table.

This will then display each of the fields within this table that are available for import. Fields that are already mapped will be indicated as such. Each field may only be mapped once within a given CMAP, but could be mapped from multiple CMAPs if needed.

In order to create a new mapping click the edit icon for the field. The tool will ask a couple of simple questions about the field before presenting a form.

  • Class - The first question asks what type of import the tools should run on this data. These classes mirror (but do NOT match) the ARK dataclasses.
  • Raw Item Val - The next question aims to establish the location of the raw item val. This is a key concept in the import tools (see notes). This question aims to establish whether a join (to a secondary table) will be necessary to arrive at this information. in the simplest case, this information is in a column on the import table itself.
  • Site Code - The system provides three options for the site code to be used in the import of each field. The simplest option is just to use the fixed site code specified in the CMAP. The second option is to use a code specified in a column of this table which will permit data from multiple sites to be imported from this field. Thirdly there is an option to get a site code from a joined table.

In addition, the value 'notset' is available for most import types:

  • notset - This provides a way to specify records that should not be imported to the database. For example, you might use a keyword BLANK or SKIP in your source data as a way to prevent certain rows from being imported. Be aware that you should carefully check over the data to be imported in the extraction test.

Keys

This class is used to import the keys into the ARK database. This tool will create the itemvalues for this module. Therefore this information needs only to be imported once. Effectively this import will 'register' all of the new records to the ARK database. In ARK there are modules that use modtypes and those that don't. The import tools can import keys for modules with or without keys.

  • Modkey - This imports keys to a module using modtypes. It does NOT import the modtype data. It will leave this column blank in the table. The modtype information MUST be imported to avoid nasty errors.
  • Key - This imports keys for modules not using modtypes.

The data in THIS field/column will be turned into the ARK key by appending the site code to the front of it. You need to ensure that this data would be suitable to be used as an ARK key. It must not contain underscores for example. Duplicated numbers int his column will also cause the import to crash.

The form will only ask for three things:

  • UID Column - see notes
  • Itemkey - The itemkey for which these values will become ARK itemvalues (defines the module)
  • not set - see notes

Boolean TRUE|FALSE|unset (attribute A)

This kind of true/false data is imported to ARK as attributes. In this case it is presupposed that every row will be imported.

  • true - Use this to set up which values will be turned into a logical positive in ARK
  • false - Use this to set up which values will be turned into a logical negative in ARK

Before running the import, you should make sure that the attribute itself exists.

Note that in ARK booleans are represented as 0 and 1 but that this may be aliased to whatever you like.

Controlled Lists (attribute B)

Data from controlled lists (aka 'look up tables', aka 'drop down menus') will be imported into ARK as class "attribute". For import purposes (only) there are two types of attribute. Type B attributes are of the kind where a series of terms in a controlled list are in play, these may or may not be applied to the item. It is NOT presupposed that a value must be set for every item. Multiple entries per item are also permitted (each one must be on its own unique row int he source data).

Before beginning the import, you must establish the attributetype of the attribute in question, if not present, this must be added before attempting to create the mapping.

Any attributes themselves may be manually set up before import or will be automatically added (and aliased) by the function if they are not already present int he controlled list for the specified attribute type.

Note: All boolean values for any attributes added will be forced to TRUE (1) in the import.

Numbers, Text, Dates

These three types of data are simply imported as is. The previous comments on raw itemvals and 'notset' apply

Spans

Note that you must have manually set up the span type.

Spans rely on having raw data in both the beginning field and end field. This data must be pre-processed as it will be entered into ark AS IS.

Only add a mapping for the 'beginning' of spans. In this mapping the column containing the 'end' data is specified as 'end_source_col'. Do not map 'ends' as this will duplicate entry

Action

At present there is no user form for adding actions to the CMAP

Links to other modules - XMI

Specify the XMI itemkey and xmi itemval column for the XMI links.

Extraction Tests and Execution

Once you have registered a mapping into the database, this can be tested. The test will do a dry run on the import and show you a table of results that show the exact values to be added to the database for this extraction from the source db. It is vital that you very closely check this data to ensure that you have got your set up correct.

In addition, check for duplicates or otherwise mangled data.

Undoing an incorrectly specified import will require use of the phpMyAdmin tool and will require knowledge of SQL.

Bear in mind that in the case of Attributes of type B, the preview will most likely show a series of SQL statements indicating the additions to the look up table (and Aliases) that would be run.