Getting the data ready

At the end of the second year of my research, I somehow managed to get my collection of high school history textbooks since 1949 complete. It was an incredible amount of work, and to give you an idea of how I got my data-sample ready, here are the steps that I took:

1: Making an overview of all textbooks

I visited the People’s Education Press (PEP) in Beijing in July 2017, and was allowed to copy a few pages per book. I decided to copy all front matter. Combining this with a catalogue that they gave to me, I made an overview of all 100+ books published by the PEP including pictures.

2: Finding the books online

Having all the information that I needed, I started collecting the books on an online second-hand book selling platform:

This platform includes bookstores, but also individuals who want to get rid of their old books. Therefore, getting the whole series complete involved a lot of texting:

  • Seller: My friend, I apologize, I haven’t seen this book in a long time. Are you in a hurry? If not, then I’ll go to my old place tomorrow to find it.
  • Me: Not in a hurry. Do you remember what it looks like? Because I already have another version, I can send a picture to you.
  • Seller: Please send it. I don’t remember what it looks like, so finding it will take some effort.
  • Me: [photo]
  • Seller: Ah, that book! I remembered when I saw it. That’s the book of my brother’s year!

3: Receiving the books

Receiving the books also sounds a little easier than it is. Depending on your address, the books will either be delivered to your house, or to a ‘kuaidi’ truck on a street corner. Either way, it usually involves a lot of text messages with codes, people calling you for instructions to your house and quests to find the right truck. When I wasn’t in Beijing, I got lots of help from several friends who received the books for me (thanks!).

Almost all envelopes I received looked similar to this one. It was really fun to open tons of these on a daily basis.

To give you an idea of how many books I’m talking about: this was the result of my three month trip to China in 2018, I guess it is around one third of the total amount of books that I bought over time.

4: Scanning

After checking the quality of the textbooks (and if needed, repeat steps 2 and 3), I brought most of the books to copy shops in Beijing. I always felt guilty for showing up with bags full of books. After some refusals, I found a woman in an excellent copy shop in Beijing that got me good quality scans in no-time, and somehow always seemed happy to see me.

5: Cleaning up

Some of the books I got were in perfect condition, but most needed at least some cleaning. Like this one:

After manually erasing all marks with Abbyy Finereader, the left page looks like this:

The most labor-intensive books turned out to be the very first series of history textbooks from 1949. Someone came up with the idea to sideline all keywords (such as names of people, dynasties and places), which made the words unreadable for my OCR software. This meant I had to manually remove all of them, which cost me about 8-10 hours per book!

6: Final check

Abbyy automatically detects pictures and text in the scanned files. Depending on the complexity of the layout, that sometimes needed some finetuning. For example, below you can see that the caption of the picture is seen as part of the paragraph underneath it, and that some sentences are split between different text blocks.

7: OCR

After manually adjusting the layout of the page, the software automatically detects all the characters. It highlights characters it isn’t sure of in blue, but as you can see, the percentage of uncertain characters is very low and most of them are correct anyway.

I saved the files as PDF (to preserve layout) and TXT (to use in software for text analysis).


And that’s that, for over a hundred different books and 14.000+ pages. I don’t even want to think about how much time and effort it cost me. You can imagine that I’m happy that I’m done. But now I have a database that hopefully lasts a lifetime, and a complete set of textbooks in my bookcase.