30 June 2013

Big Data Analysis

There's a fantastic series made possible by the University of California, Davis Genome Center, and David Coil on how to access, parse, and analyze big data sequences.

How to display content, and basic grep commands

In summary:

less fileNamePath

used to display file

grep item fileNamePath

searches and displays item at file

grep -c item fileNamePath

returns number of occurrences item exists at file

How to parse data with grep

In summary:

grep item originalFilePath > newFilePath

create a new file that contains data with the item parsed

grep -v item originalFilePath > newFilePath

create a new file containing everything except data with the item parsed

How to analyse data

In summary:

find occurrences of item of interest
find total number of occurences

Please support this awesome series and subscribe to David's channel on Youtube, where you can also view the series, and more, in it's entirety.

13 June 2013

NSScan and NSString

Last week, I had written about the usefulness of the HPPLE library for parsing the internet. There's no doubt it's a great and powerful instrument for parsing XML, however, circumstances prompted me to dig a little more deep, and investigate whether or not there were a more simple or reliable means of parsing data.

After having seen how NSString methods can effectively displace the need for regular expressions, I did a bit of searching and came across the NSScan class. This class takes a string as a parameter, splits the input to the point where the parameter is reached, and saves the remainder of the input to a new string, if allowed.

http://stackoverflow.com/questions/6825834/objective-c-how-to-extract-part-of-a-string-e-g-start-with

Essentially, instead of traversing a path, the NSScan identifies a sequence of chars (as an NSString object), and splits the document at that site. This is the ideal method to parse CSS, something that was not possible with HPPLE, in addition to other sources including XML.

NSString *inputString = @"this is your input, can be CSS or XML"; 
NSString *startTag = @"enter first bound word or phrase to search up to";
NSString *endTag = @"enter last bound word or phrase to search up to";
                               
NSString *savedString = nil;
                               
NSScanner *scanner = [[NSScanner alloc] initWithString:inputString];
[scanner scanUpToString:startTag intoString:nil];
scanner.scanLocation += [startTag length];
[scanner scanUpToString:endTag intoString:&savedString];

This was then abstracted into a method, as below:

+ (NSString *)scanString:(NSString *)string startTag:(NSString *)startTag endTag:(NSString *)endTag
{
    
    NSString* scanString = @"";
    
    if (string.length > 0) {
        
        NSScanner* scanner = [[NSScanner alloc] initWithString:string];
        
        [scanner scanUpToString:startTag intoString:nil];
        scanner.scanLocation += [startTag length];
        [scanner scanUpToString:endTag intoString:&scanString];
    }
    
    
    return scanString;
    
}

Parsing can be carried out simply as a single or sequence of this method call to obtain the desired end string product. Thank you to Natasha Murashev for help to make this possible (http://natashatherobot.com/, https://twitter.com/NatashaTheRobot).

Thank you :)

Library for the native parser available on github at https://github.com/aug2uag/NativeParserLibrary

06 June 2013

HPPLE and accessing paths in XPATH

WHY?? WHY OH WHY???

Ray Wenderlich's tutorial on utilizing HPPLE is freaking amazing! I highly recommend using these classes for parsing HTML and other XMLs. My experience with using HPPLE made me feel it was very similar to parsing JSON, although a wee bit more uhhh... challenging. Unfortunately for me, I was not able to find a good XPATH reader, as there are many available for JSON for free and online. Nonetheless, a little bit of experimentation and testing went a long way to find paths to the desired locations.

Here, I'll mention some of the three most used techniques for me during my process of parsing HTML with HPPLE:

1- Test, experiment, repeat

To access a path, there are numerous challenges. First and foremost, make sure you are using the correct path, and that the spelling of your path is correct. The Google Chrome Developer view is very helpful to view the path, make sure your quotes are alligned and balanced, and that your spelling is correct. Also be sure that you are parsing the correct URL. Check, and double check everything regarding your XPATH NSString path, and URL.

For instance, in Google Chrome Developer you can highlight most section(s) of a webpage and "Inspect Element". The path will then be provided to you at the bottom of the developer window, and you must then enter that information into your NSString. Such as seeing:

div#sectionOne, div#sectionTwo, a, p, h2

That would be translated to NSString as:

@"//div[id='sectionOne']/div[id='sectionTwo']/a/p/h2";

If your Breakpoint or NSLog shows a nil/null object, don't despair! Check your path again, check your quotes, make sure you are indicating the correct tag and whether it is id, class, etc. If you still have trouble, reduce your path. For instance, in the above, you would begin by removing the h2, and if that still doesn't work just keep removing until you have an array that you can use.

2- Access arrays, and access dictionaries

Sometimes, the XPATH array you will be returned is freakish looking. A monster string that is way too freakish to inspect as an array. In such cases, utilize fast enumeration! You can print out your array or dictionary via :

for (TFHppleElement *element in yourArray) {
 NSLog(@"\n%@\n\n", element);
}

You will often notice that your array will be a combination of strings, dictionaries, and/or arrays. If you come across trouble/difficulty with your data structure there are a couple approaches that may benefit you. These include to create a custom method in your TFHppleElement header and implementation files to access node(s) you are frequently encountering in your XPATH. Identify whether you are dealing with a string, array, or dictionary; and utilize the proper instrument to access the paths, as necessary. Common will be the TFHpple instance methods with the 'content' method, array brackets, and valueForPath:@"string". You can really benefit by breaking your large data structure down one piece at a time, instead of being stuck on trying to narrow it down in one stroke.

3- Regular expressions and string methods

Despite the awesomeness of HPPLE, you will sometime(s) reach a dead end where you are no longer working with XML, and are left with exorbitant whitespace and/or nasty tags. You can benefit from utilizing NSString methods and/or Regex to take care of such mishigas. Deal with what you have, and you will find at least some form of happiness with HPPLE, as many before have done for what may be considered a long or very long time in the world of iOS programming.

Alright!!

I hope this gives you a little bit of motivation in your pursuit of parsing XML, including HTML. If you are in the middle of a project currently, keep in there! There are great alternatives to HPPLE, and honestly your choice will probably not affect the growing pains that much AND HPPLE is a great and powerful instrument!

Sometimes though, the XML is embedded and not accessible. You can parse the entire accessible tags via assigning /html for the path, and you can do a basic find to see whether or not your element is accessible. If not, look for a different URL, or you may have to look to another instrument and/or approach for your immediate needs. For more, check out the example project I made to parse data in Toys'R Us at github.

aug2uag

Wikipedia

30 June 2013

Big Data Analysis

How to display content, and basic grep commands

How to parse data with grep

How to analyse data

13 June 2013

Native parsing in Objective-C

NSScan and NSString

Thank you :)

06 June 2013

HPPLE and accessing paths in XPATH

WHY?? WHY OH WHY???

1- Test, experiment, repeat

2- Access arrays, and access dictionaries

3- Regular expressions and string methods

Alright!!

Blog Archive