Import CITI Program data

The Collaborative Institutional Training Initiative (CITI Program) provides peer-reviewed, web-based educational courses in research, ethics, regulatory oversight, responsible conduct of research, research administration, and other topics pertinent to the interests of member organizations and individual learners. Many universities use CITI for their training.

A CITI user with administrative access is able to download training data for their institution. This download is very simple.

Importing this data can be challenging.

For my task, I had to import the CITI data into a MySQL database. I got data truncation errors on columns when the mysql import found character sequences it did not like. 

Warning 1366    Incorrect string value: ‘\xE1ndez …’ for column ‘Name’ at row 1219

The value it hiccuped on in the Name column is Flores Hernández.

Reopening the CITI data file using Sublime Text 3 with various encodings seemed to show that its encoding is Windows 1252.

 In particular, I saw accented Spanish vowel characters in the CITI file. 

I attempted to use the file command to identify the encoding, but my version of file was not up to the task. It only looked at the first portion of the file and my special characters were much deeper. Sublime Text offered some insight. Notepad++ would also work.

This command worked to convert the import file to “plain” ascii.

iconv -f windows-1252 -t ascii//translit citi-data.csv > converted.csv

Although losing the Spanish vowels was not ideal, it was valuable because I needed to process the import file using sed and sed just isn’t able to handle cool characters. At least my version isn’t.

After some reading, I found that the Windows 1252 character set is latin1 to MySQL. This meant that if I skipped sed, I could import directly into MySQL with no column truncations by using the correct character set. I just had to connect the dots that Windows 1252, or cp1252, is latin1.

So a MySQL import in a bash script like this works:

CS="CHARACTER SET latin1"
$MYSQL --local-infile -h localhost -u $DBUSER -p$DBPASS $DBNAME --execute="LOAD DATA LOCAL INFILE '$INFILECSV' INTO TABLE $TABLE $CS FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 2 LINES; SHOW WARNINGS" >> $WORKDIR/$TABLE.log

Ignoring two lines was required because the first two lines of the CITI data are the date and the column headers.

2/19/2020
member ID number,First Name,Last Name,email address,Registration Date,name of curriculum (i.e. Human Research),Group,Stage Number,stage description (i.e. Basic Course),completion report number,date completion report was earned,learner’s score,Passing score,date completion report expires,group ID number,Name,Username

CITI Import Table

Here’s the table I’m using to import CITI data.

drop table citiimport;
create table if not exists citiimport (
  memberid int,
  name_given varchar(60),
  name_family varchar(60),
  email varchar(200),
  registrationdatetxt varchar(10),
  curriculum varchar(160),
  groupname varchar(160),
  stagenumber int,
  stagedescription varchar(40),
  compreportnumber int,
  earneddatetxt varchar(25),
  score int,
  passingscore int,
  expirationdatetxt varchar(25),
  groupid int,
  fullname varchar(120),
  username varchar(100)
) engine=innodb
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
;

Dates

I have seen earned and expiration dates like

11/27/2017

8/13/2018 7:05:01 PM

I have not seen 24-hour times for these dates.

The registration date has consistently looked like

09/05/12

References

https://coderwall.com/p/gsdx7w/importing-latin1-data-to-mysql-is-being-truncated

https://dev.mysql.com/doc/refman/8.0/en/load-data.html