NLP AND ME! - E2: COLLECTING TEXT DATA


One of the most important thing in NLP is text data. Collecting text data is not a simple task, especially when it comes to minor language like Mizo.  This time I'd like share some simple tactics that I used for collecting a data for  Natural Language Processing research last year i.e. 2015 - 2016 academic session. 

A clean Mizo text data is not simply available. Since I was responsible for collecting a huge amount of clean Mizo text data, I had to go to some office of local newspapers like Vanglaini. We get a big file (may be larger than 3GB), but when we try to work on it, it is just a collection of useless stuffs for us. So, I had to make a clean data by myself. I plan to download every pages of their website and extract a clean text data from it.

I am a web developer! I know how websites work and how files like web pages are stored in the server. I know the pattern how they can display the pages.

If you see some websites, you may have seen the URL of the page ending as ?id=1234 , ?page=23, ?userid=1256, etc. These are the query with which you can request a page.

For example :
If you see  www.angelvestgroup.com/info.php?id=1, you will be redirected to a page. Now, if you modify the id to 2 i.e. www.angelvestgroup.com/info.php?id=2, you will go to a different page. Like that , you can go on. 

When a data is entered into the database, all the entry has given an ID or name so that the particular data can be extracted and displayed in the web browser. But I do not say that this is the only way! 
If you are a Facebook user, you may have seen something like profile.php?id=123456789! This is the profile ID of the user. By going to www.facebook.com/profile.php?id=XXXX, you can see whose profile is that.

Like this way, most of the news websites and blogs are implemented. 

Apparently, Vanglaini website uses a Laravel PHP Framework. If you see their website you will see a pattern in their web page which is similar to the above mentioned technique. 

They have six (6) directories viz., tualchhung, hmarchhak, ramchhung, khawvel, thalai and infiamna.

All the pages in the website have an ID which can be extract and displayed simply by the format:
        www.vanglaini.org/any_of_the_above_mentioned_directory/PAGEID

                  e.g: www.vanglaini.org/tualchhung/23456

The website have a good MVC (model view control) thing which is very good. The URL "www.vanglaini.org/tualchhung/12345" will display the same webpage as "www.vanglaini.org/thalai/12345" or "www.vanglaini.org/any_directory_name/12345". 

Since I recognized all these patterns, I can simple use the "wget" command in my Linux system to download all the pages that I required. 

I simply used the shell command below which gives me all the web pages that I require.

#!/bin/bash
for i in $(seq 1 1 61234)
do
   wget http://vanglaini.org/tualchhung/$i
done

Now after I downloaded all the requires pages, I need to make them into a text file. For this, a very simple but powerful program html2text is there to fulfill my requirement. The following lines of bash code did everything for me.

for file in `find . -type f -not -name "*.*"`; do html2text "$file" > "$file.txt"; done

This lines of codes converts all the files to a text file (.txt).

Now, I need only the TXT files. I can delete all the files which is not .txt file. I can do this by
          rm !(*.txt)

This bash command works very fine for me.

Now the only thing that I still need to do is to merge all the text files into one file, which can be done by using the cat command

cat *.txt > final.txt

which merge all the contents on all the txt file into a file called final.txt file.

In such a way, I collect a ~1GB of clean Mizo text data.

I tell you this, collecting 1GB of text data is such a big task and takes lots of time.






Share: