Programming Process (Coming Soon)
Jane has helped me to install Vufind on WiderNet’s mumbai server, and with Brent’s great help on indetifying and solving software problems, we finally uploaded the Project Gutenberg’s records to run a Vufind test!
Next week, Cliff might be able to help me tranform eGranary’s catalog records into MARC records, because Vufind is “MARCcentric” (Brent’s word). I have consulted Wendy and Jen at DLS about the crosswalk, they are so great to offer their metadata knowledge to help me!
If I can help eGranary have something real progress happened, that will be very rewarding conclusion of my spring semester.
Hans Rosling demonstrates how we can use opened data to better understand the world, and the problems of the world.
Hans Rosling: Debunking third-world myths with the best stats you’ve ever seen (2006)
Hans Rosling: New insights on poverty and life around the world (2007)
There are automatic crawlers archiving the web, but the difficult thing is to catalog the archived web. Automatic metadata generation is still not accurate.
A very useful resource: International Web Archiving Workshop
1. For the metadata management system, It’s quite possible that we to use the toolkits which are under development by the XC (Extensible Catalog Project).
2. For the existing metadata harvest, we will try OAI-PMH, which XC has developed a toolkit to harvest metadata through it and Internet Archive have shared its metadata through it. Project Gutenberg’s metadata has already been shared through Internet Archive’s OAI too.
3. How to integrate DMOZ’s metadata is still a problem. I found a Librarian’s Internet Index, however, it does not share their metadata with anyone.
The is a blog dedicated to recording my journey toward a real librarian.
I am currectly a graduate student studying in the School of Library and Information Science, at the University of Iowa. I am also a IMLS digital library fellow and so I am working on a different digital library project each semester along with my course works.
I hope to use this blog to record my learnings from both my courses in library school and my experience in digital library projects. There are so many interesting topics and I am especially interested in Open Access, Semantic Web, and many more.
So, come back to check out my posts!
I want to record the several lectures and workshops I attended recently, since they are very interesting to me:
- Professor Richard Furuta is from the Department of Computer Science at Texas A&M University. He is the director of the Center for the Study of Digital Libraries. He gave us a lecture on “Digital Libraries and Digital Humanities: Experiences with Research Partnerships among the Liberal Arts, the University Libraries, and Computing.”It is really exciting to see some of the projects he is involved. Different from most of the digital library projects here, his projects are very special in that each one is developed uniquely. Instead of using a general online collection management software, those projects are programmed by computing persons and related researching scholars in the humanities fields are closely involved too. This strategy of coordinating three parts within the university: liberal arts, libraries, and computing, is time and effort consuming, but very rewarding. These collections are rich in content and elegant in design.Nautical Archaeology Digital Library at Texas A&M University
Center for the Study of Digital Libraries
- I attended an introduction lecture on the CMS Drupal, an introduction training on Drupal, and I am going to attend another training on Drupal. Jane, a 2007-2008 digital library fellow, working on WiderNet project this semester, coordinated this series of workshops.Drupal is an open source content management software and everyone can use it to set up an onine community, which can be very influential in real life. And Drupal is hot in the library world because the Library 2.0 idea can be realized by Drupal. There is a report about this topic: Drupal in Libraries by Any Austin and Christopher Harris.
- Angela, a 2008-2009 digital library fellow, coordinated a online training on MODS provided by a technician from Library of Congress. MODS is a subset of MARC. I has been widely used to describe online collection content. In addition to the knowledge about MODS, we also have a chance to see some of the LC collections on websites and webpages. It really makes sense to archive websites especially during special events like “9.11” and etc.
I really enjoy these eye-opening lectures and workshops. At the end of this semester, now I am having a much better sense of the digital library world. It is interesting to know that there are different ways to set up a digital library collection: by a ready to use content management software, like ContentDM, with many limits on possible formats and function but very efficient; or start from scratch to let the computing persons develop a particular web engine and involve the scholars closely to know their research need. Content Management Software, like Drupal, which is free, provides a third way to give you some maneuver about the function of the sites and the beauty of it is the realization of 2.0 idea to involve users’ inputs.
I need to update this blog entry by a wonderful presentation:
- Wendy and Ann from DLS gave a presentation on “Digital Libraries and the Systems that run them.” I hope I had attended this presentation at the beginning of this semester, since it is very informative, especially for someone who are new to the area, and even to me now, who has been worked in the area for a while. I like they talked about the variety of CMS and especially their lists of concerns when evaluating CMS for local need: “
- ¡What type of content will you include?
- ¡What features are most important to you?
- ¡How comfortable are you with IT?
- ¡ What local support will you have?
- ¡Do you want one piece of software for all things or multiple for “best of breed”?”
They also mentioned more CMS and the sample collections that I heard about for the first time: DigiTool, Luna, Greenstone, CollectiveAccess, Omeka. Wow!
I read the news from LOC’s Digital Preservation Department about the initiative of several federal agencies trying to set guidelines for evaluating image characteristics and establishing metadata elements, which will serve not only the agencies, but also digitization service providers, equipment manufacturers, and other technologists.
When we talked about the federal government poster collection earlier this semester, Marianne from the Government Publications mentioned the possible partnership with the federal government on this collection, and she asked Mark about whether our digitization will meet the federal standard. Mark glanced the standard at that time and it seems not hard to meet with the standard. But I didn’t remember there are standards about metadata etc. It will be interesting to look at those too.
There seem to be more than one standard or best practice in the digitization field. I want to study and compare them when I have time.
Since I have posted two scanners images before, I will post this third one: a jumbo scanner! Maybe I could have a collection of scanner images soon.
The Federal Government Posters Collection is online now, although it has not been published.
Since I have scanned some posters and so I was able to test uploading some of them onto ContentDM server and Mark went through the homepage generating process with me after I successfully uploaded 48 poster images. Once I have the images in one folder in local drive and a txt metadata file in local drive with the same order of metadata fiends as the collection metadata schema we set up before, the uploading is not hard at all. The software will match the images and their metadata automatically.
By now, I have went through all the basic steps of building a digital collection: harvesting metadata, digitization, and publish. Wow! But there are so many works remain: finalizing the metadata, finishing the digitization, uploading both simple and compound projects, and post-production tasks such as advertising the collection for wider usage … I am not sure whether I will continue this project in the Spring semester or not, but I will do a presentation on it at the end of this semester. So I should begin to think deeper about what I have learned from it.
54″ Large Format Scanner
Last week, Mark arranged a trip to visit the roll through scanner on Oakdale campus, which is going to be used to scan the large posters. Besides Mark and I, we have a group of people from various departments: Digital Library Services, Government Publications, and Conservation Department.
This 54 inch wide scanner is in the Geological Survey Building, and has mainly been used to scan large format maps. We brought several posters to have a test scanning and it worked really well in terms of both convenience and quality. The conservation department requires us to send the posters first to their department to be cleaned and repared before they were sent to be scanned by the roll through scanner. They will also make several different sizes of plastic covers to cover those fragile posters when they are scanned.
There is an issue about whether to barcode the posters first before they are sent to Oakdale campus to be scanned, so that we can name the scanned images by their barcode, and the cataloginging can be done before posters are scanned. We are still waiting for the decision from the Government Publications.
If the scanning of the large posters can not be started before the winter break, I might spend one or two weeks during the winter break to work on the scanning.
Although I haven’t finished cleaning up the metadata spreadsheet, Mark said it is very close and we still need to wait for Jen’s return to finalize the formats for some fields. Although we don’t have perfect metadata, Mark thinks it is not bad to practice some simple uploadings. He has taught me how to use ContentDM to upload a batch of images with a deliminated metadata file. I just need to start scanning the posters!
In a spreadsheet, I have grouped the posters into three groups: approximately 200 small ones, 700 medium ones, and 500 large ones.
I am going to scan the 200 small posters (less than 28cm x 43cm) with the regular flatbed scanner in the DLS project room. Keo showed me how to use the scanner and set up a folder for my scanned images. I scanned 11 posters in the “A” drawer to familiarize myself with the scanner.
I am going to use the top-down scanner to scan the 500 medium-sized posters (less than 63cm x 46cm) . I actually finished scanning about 60 posters on Friday morning! The top-down scanner worked really well for the posters; except for one time, when I lifted up the cover glass but did not put it further against the computer monitor, the scanner began to show over-exposed images. But Keo had guessed the problem when I went to ask him for help, so we put the scanner back to normal quickly.
We are going to scan the 500 large posters using a roll through scanner located in Oakdale campus. Mark arranged a trip to visit and test the scanner on next Wednesday . The only unsolved issue is how to transport these large posters. Kristin from the conservation department will figure out how to do this.
So, suddenly, the digitization starts and the the project begins to progress at a quick pace! Exciting!
The top-down scanner finally cooperated with us! Amber and I scanned four more medieval manuscripts last week.
Amber and I did a comparison between the images scanned from the top-down scanner and those photographed with a digital camera. Because of the high resolution of the scanned images, we decided to use the scanned images.
The scanned image size is large; even the compressed jpg file is over 3MB per image. Slower computers will take longer to download each image. The banding (Univesity of Iowa. Special Collection Dept.) for an ordinary collection image is extremely tiny when added to these large images, and unfortunately there is no way that we can set the banding to be compatible with the large image size. I had thought that we would need to resize the images so that people can see them faster and the banding would look good. My thoughts turned out to be off track. Mark said it is not wise to sacrifice the quality of the images in order to fit the banding. We should always keep the image resolution as best as we can; that’s why we don’t care that the images in the collection are having various resolutions. The scholars who are interested in the medieval manuscripts will be glad to see as much detail as possible. When you zoom in on the images of music book leaves we put on the collection web page this week, you can even see the pores on the parchment! No wonder Mark won’t let me resize the images. I should have made users’ needs my highiest priority, not the web page layout or any other concerns.
I really like the merge function in Photoshop software. These stitched images look great. We also found that Photoshop Element 6 is more powerful than Photoshop Element 5.
By now, we have scanned only leaves using the top-down scanner. Next week we will start to scan books! This scanner’s name is “Book Copier,” which implies that it should work best for copying books. There won’t be heat or pressure on the books, and there is even an adjustable book cradle! I feel that I am going to love this scanner although there were a lot of troubles in the beginning.
After looking through over a thousand federal government posters and gathering the approximately 450 OCLC records for the posters without InfoHawk records, I finally finished preparing for the sources of metadata. From now on, I am in the productive period of the federal government poster project!
Wendy extracted the 450 OCLC records into a XML file, and a person from the IT department helped transfer the information into a spreadsheet. Because one persons transfered the previous 1000 records and another transfered the present 450 records, we ended up with two spreadsheets with different headings. Wendy and Ellen tried to decipher the codes of the data, and then Mark and I spent the whole Friday morning combining the two spreadsheets. Ellen also worked out a metadata schema for us, which is also a crosswalk between the OCLC’s MARC metadata scheme and ContentDM’s conventional metadata scheme. Based on this mapping, I can now clean up the OCLC data to make it ready for ContentDM uploading.
Posters have really interesting characteristics in terms of metadata. First, posters don’t have an obvious title as books do. Often catalogers have to decide by themselves which information on the poster should be included in its title and different catalogers may give the same poster different titles. So many posters have alternative titles, and many of then have notes that give more information on the poster. Second, finding government poster’s creator is problematic. Sometimes, if the poster is a reproduction of an art work, the artist’s name will appear on the poster and can be defined as the creator. But most of the posters are not art works but created by some designers who worked for the government. These designers’ names won’t appear on the posters, and cataloggers often define the creators as the government departments that published these posters, who are not the actual creators.
Last Thursday, I had a chance to talk briefly with the UI library cataloger Sue, who is in charge of cataloging these federal government posters that I am working on. I will try to meet with her to discuss more about this project. Sue has made one mystery clear to me: why many posters have a SuDocs number written in pencile on the back of them, but the OCLC records do not show their SuDocs numbers. The reason is that when these posters are released and deposited into different federal publication depository libraries, sometimes there is no time or energy to catalog them in the electronic system, and they are merely assigned SuDocs numbers in pencil and on their back. Sue has also offered help to try to look up those posters without either InfoHawk or OCLC records.
I will continue to clean up the metadata records to format them for ContentDM uploading. After that, I will start to scan part of them. I am so glad that I have passed the preproductive phase of the project!
We have been waiting for working on the top-down scanner since this semester began. Finally it seemed that we are able to start. On Thursday morning, Amber and I were trained by Mark, Kristin, and Keo about how to use the scanner. We were very excited to plan to scan some of the remaining medieval manuscripts on Friday. However, when we started to scan a leaf of the manuscript on Friday afternoon, the result was an over exposed image. When we tried more, it simply became worse and worse. Bill, Kristin, and Mark all came to help solving the problem, but no one can figured it out. So, once again, we have to wait. Later on, we heard from Bill that the scanner was back to normal after it was turned off for 15 minutes. So, hopefully. it was just tired and needed some rest.
Since the beginning of this semester, while waiting for the top-down scanner, Amber and I have been working on the medieval manuscript that Amber had digitized during the summer.
After editing over 600 images of the manuscript, Amber experimented with all kinds of file naming strategies. Mark introduced to us a batch file renaming strategy. Windows Vista provides the option, “copy file path.” We followed Mark’s instruction and figured out how to generate the renaming commands in Excel. But eventually we still had to ask Mark to sit down with us when we were using the method, which is basically running a DOS file within the folder of the files we want to rename. It turned out very well!
Before uploading images onto ContentDM, Mark had a concern about the total size of the images, so we compressed all the images. The “multiple file process” option in Photoshop made the task very easy.
Finally, we were ready to upload the manuscript to ContentDM! This was the first time I saw how ContentDM works to manage part of a collection. Amber showed me step by step. She was even able to show me how to work on the metadata after uploading all the images. I am very excited to see that the manuscript is now online! It is called “Missale Romanum.”
For most of the last two weeks, I have been working on getting OCLC records for the posters held by the UI library that don’t have InfoHawk records. If you stop by the Government Publications room, you will see me sitting in front of a big table covered with colorful posters, checking posters’ SuDocs numbers, and working on a laptop to make check-marks in an Excel file, search on WorldCat, and record OCLC numbers.
I can’t help mentioning the change in the Government Publications Department. I remembered years ago when I came to the department to look for some UN publications, the department was a closed-door and separate room. A reference librarian was greeted me and helped me to locate my material. Now, the department door has disappeared. It is an open shelf and integrated part of the library. Two computers and several big study tables are there for students to use, a telephone is set up on a small table for users to call if they have questions, and there is no personal reference librarian at all!
Users can search OCLC (Online Computer Library Center) for all the holding libraries of the material they are interested in. In addition, OCLC makes it possible for libraries globally to collaborate on cataloging efforts. For new items that already have records in OCLC, each participating library simply migrates the OCLC records into its local cataloging system and does not need to do the cataloging. What a big relief from the repetitive localized cataloging work! But sometimes there are lags between the global system and the local systems. For example, several hundreds among the thousand of government posters that are actually owned by the UI and that do have OCLC records, do not have records in the InfoHawk system. That’s why I need to find the records for these local invisible posters.
Good news: Mark and Kristin have already learned to use the top-down scanner. Kristin is going to give Amber and me a introduction to the scanner next week. Then we will be able to digitize the remaining medieval manuscripts.
Through “worldcat.org,” the general public can search OCLC records and find out which libraries have the material they want. Our university subscribes to a more comprehensive worldcat database that has a FirstSearch interface. In addition to these two versions, Wendy showed me that there is actually another interface to which only catalogers have access. From InfoHawk, she obtained OCLC numbers for posters in our collection. Then Wendy used these OCLC numbers to extract posters’ records from OCLC, and put them into a XML file. The LIT person helped us convert the XML file into a delimited file that is readable by Excel. This information will be used to create metadata for the poster collection.
The next several steps I will follow:
- Find posters that do not have records in InfoHawk, look up their information in OCLC, give Wendy their OCLC numbers for her to extract the records.
- Categorize posters into three groups according to their dimensions. Smaller ones will be scanned by a regular scanner, middle-sized ones will be scanned using the top-down scanner, and bigger ones will be sent out to a roll-over scanner.
- Meet with an expert to prepare a metadata schema.
This week, I helped Amber edit the digital photographs that she took of one of the medieval manuscripts. First, we performed some simple editing: rotate, crop and straighten. After this, we will adjust the colors of the images. In order to do that, Keo and Amber have calibrated three moniters so that the color presented by these three monitors can be more accurate. So, one lesson I learned is that you need to calibrate your monitor in order to get accurate color presentation.
Another interesting lesson I learned this week is about how to rename files in a batch. Wendy showed us that there is software that you can use to get a batch of files’ information (including names and paths) and export them to a spreadsheet, where you can generate compound file names easily. Then you can generate a *.bat file and save it to the folder that contains the files you want to rename. By a single click on the bat file, you will rename all the files you want to rename! This strategy works well when you have a few folders that contain many files, but not so well when there are many folders with a few files in each.
Naming files correctly seems to be complex. I will think about it and discuss it with Amber next week.
The government posters that I am working on are cataloged by SuDocs system. The first drawer contains posters with SuDocs numbers starting with “A,” which stands for Agriculture Department and its subordinate bureaus and offices. For example:
- A 1. : Agriculture Department (including Secretary’s Office)
- A 13. : Forest Service
- A 21. : Information Office
While waiting for the LTC to give us the speadsheet of InfoHawk records, I exported the search results for posters beginning with “A,” and tried to match the results with posters. This time, I looked at them more thoroughly. I opened the drawer, removed the posters one by one while matching them with the records, and then put them back in the drawer. The posters have a variety of sizes: small, medium, and large. I recorded those that did not have a record in Infohawk, and next, I will look for their records in other catalogs.
I met with Mark and Marianne, who is in the government publications department, earlier this week. I can start the GPC (government poster collection) project now!
- We have electronic catalog information for those posters dating from 1976. (searchable through InfoHawk, OCLC, and government publication catalog)
- For those posters published before 1976, we have a printed catalog. Some of the old posters might have electronic records.
- There might be some posters without any records.
Matching records with posters
- Wendy conducted a command search in InfoHawk and she found over 900 records.
- I printed the first 60 records from Wendy’s original list to do a preliminary match with the posters. I found almost all the recorded posters. It seems that Infohawk contains most of the poster records. Some of the other poster records can be found through OCLC.
- Some important searching commands Wendy used to search InforHawk:
- “wcl = postr and wow = gov” gives us records that came from the Marcive program. Records that were imported individually are not included.
- “wcl=postr and wgv=f” would be ideal, but it results over 1000 records, which can’t be sorted.
- “wcl=postr and wlc=a –” imprecisely limits results to the SuDoc letter starting with A.
- Mark told me that the LTC will give us a speadsheet of all the poster records in InforHawk. I will be able to use that speadsheet to keep track of the poster scanning process and prepare matedata for the collection.
Government Partnership: there is a possibility that we can have the GPO’s partnership on this project
- “Content Partnership assist with providing permanent public access to electronic U.S. Government information. Partners agree to provide storage capacity and user access without restrictions on re-dissemination. In the event the partner is no longer able to provide free, public access to this electronic information, the partnership requires the agency or library to transfer a copy of the content to GPO. GPO will then make the content available either through GPO Access or in cooperation with another partner.” http://www.fdlp.gov/partnerships/about.html
- Registry of U.S. Government Publication Digitization Projects: a list of digital government publications. We might be one among them.
Now I am waiting for the speadsheet of InfoHawk records. Mark and Marianne are still working on the scanning issue.
Best Practices & Publications provides information on Digital Imaging and Dublin Core Metadata.
I looked through the collections from the website: “Registry of U.S. Government Publication Digitization Projects“, which contains records for projects that include digitized copies of publications originating from the U.S. Government.” I found two related collections:
- Illinois Digital Archinves contains a collection of World War II Posters. The collection is managed by ContentDM.
- World War Poster Collection in University of North Texas Libraries: I haven’t figured out which management software they used. It’s a ongoing project. Here is a link to UNT library equipment for digital projects.
Before meeting with Marianne from the Government Documents Department, I prepared by looking for similar government poster collections. Mark suggested that I look at some poster collections managed by ContendDM:
There is a war poster collection from University of Washington Libraries:
It’s going to be a nice example for me, I think, because it uses CONTENTdm; however, a digital camera was used to digitize/scan the posters. (Olympus C-2000 Zoom)
I have also searched for other existing government poster collections, and found out that there are two big ones, both of which are World War II poster collections. I looked at these websites and found their “Technical Information” sections helpful. The earlier collection uses a film camera to digitize posters, and the other uses a digtal camera to do it. We, on the other hand, are going to use a scanner. It will be interesting to see the differences.
World War I and II Posters and Postcards
(uses Power Phase FX digital camera system to scan)
Another issue about this Govt. Posters Collection is that US government publications have their own cataloging system, so I also searched for that and found some references:
- There are 18 items right now. To make these old treasures accessible online by modern technology is really exciting just by thinking of it. I was wondering whether the users could easily download and print them. Mark showed me that the users could download only low resolution images, but they could view high resolution images online, which I guess is good enough for private research purposes. I am also concerned about my lack of knowledge about how to handle these fragile materials, but it seems that I could always learn from Amber, and Amber can learn from Kristin in the Conservation Department.
- Mark also demonstrated CONTENTdm to me, which is a unit of digintal collection management software. He showed me how to create a simple or compound project, create and enter metadata, add band, upload a project, and approve uploading, which sounds overwhelming, but I hope the real project practice later will make my learning easier. Plus, most of my work hours overlap with Amber’s! I hope that I am not a big burden to her.
- We are still waiting for the new scanner to come some time next week, so that we can use it to scan large flat items. Fortunately, Amber already has some scanned images that need to be edited and uploaded, and she can show me how to do these things early next week. I am looking forward to it!
My first semester as a digital fellow, I am going to work at DLS, on the following two projects, and be mentored by Mark Anderson:
1. Help Amber to finish the Medieval Manuscript Collection;
2. Prepare to work on the Government Posters Collection.
The following is quoted from Nikki Saylor’s description of the collection in the email she sent to me, which sounds really interesting.
- “University of Iowa Libraries has been a federal depository library since 1884 and a regional depository in partnership with the Government Printing Office (GPO) since 1962. In that time UI Libraries has accumulated a collection of government-issue posters that promote government services, programs, and initiatives or have been used as social marketing tools. Posters issued by various federal agencies provide information on topics ranging from AIDs awareness and civil rights to national parks and the Viet Nam war. “
- “These posters often represent a graphic documentation of priorities of a given presidential administration or reflect social culture at a discrete point in time. Nearly all federal agencies, both past and present, have produced posters including the Works Project Administration, War Mobilization Office, EPA, Dept. of Interior and NASA. This visual collection has the potential to complement academic course work in public policy, history, communication studies, and health sciences and to enhance outreach activities to primary and secondary (K-12) students. “
- “Because of the physical attributes of posters, they are highly vulnerable to damage through mishandling and inadequate storage. This underused collection, a subset of the SuDocs Hidden Collection, lacks consistent online bibliographic description, has not been promoted as a resource due to the practical difficulty of handling large flat paper sheets and has consequently been designated as non-circulating. There are no collections in the State of Iowa that approach the size of the UI Libraries’ poster collection and no poster collection has been digitized from any of the 52 regional libraries nationwide. “