Showing posts with label RSS. Show all posts
Showing posts with label RSS. Show all posts

Sunday, August 16, 2015

Joomla componenet: Article generator from RSS feeds - part 4

 My friend who came with the request for this Joomla component gave me some feedback:

  1. - you should be able to split the article into an Intro and Fulltext as some RSS Feeds return large articles
  2.  - the photos should be saved locally (it's good for SEO)
  3. - some RSS feeds contain unwanted links, maybe do something about that also


1.    Something that I learned about articles in Joomla while researching for this task is that the content of the intro and the so called full text are stored in two different columns of the database. It is not supposed to have in Fulltext column also the intro part, only the continuation. Depending on your settings, when you click "Read more" you will be redirected to a page where you will see only the continuation or a concatenated version of Introtext and Fulltext. The configuration is done from:

System ->Global configuration ->Articles ->Show Intro Text set to Show or Hide

I added the logic in the RSSReader.php file found under /models. There is a new function called "htmlParser" which do the magic by calling some other functions:

/**
 * This function identifies images in HTML, download them locally and replace links to external images
    * with links to the local copy. Also splits article into Intro and Fulltext
    *
    * @param string $htmlInput  - string containing HTML to be parsed
    * @param integer $allow_links  - if 0 delete links, if 1 allow links
    * @return array $article_array, $article_array[0] contains introtext, $article_array[1] contains fulltext
    */

    public function htmlParser($htmlInput,$allow_links,$split_after_x)


I searched the internet for a function who is able to split the text containing HTML without breaking any tags. I found one from from CakePHP framwork text helpers via this blog: https://dodona.wordpress.com/2009/04/05/how-do-i-truncate-an-html-string-without-breaking-the-html-code/


2. For saving the photos locally I created a method called getImage which is called by htmlParser.
I am saving the images at this path:  administrator\components\com_rssaggregator\assets\images

I use pathinfo() to get information about actual image file from the link. I also add a random string to the name to be sure is not overwriting some other image with the same name.

public function getImage($link)
    {
        //assume initially that downloading operation is unsuccessfully
        $flag_success=false;
       
             
        $local_path= JPATH_BASE.DS.'administrator'.DS.'components'.DS.'com_rssaggregator'.DS.'assets'.DS.'images'.DS;
       
        //get information from path using pathinfo
        $path_parts = pathinfo($link);
       
        //if file has no extension consider default to be .gif
        if(isset($path_info['extension'])){
            $extension = '.' . substr($path_parts['extension'],0,strpos($path_parts['extension'],'?'));
        }else {
            $extension = '.gif';
        }
       
        //name of the file without extension
        $base_name = $path_parts['filename'];
       
        //local file name will be source file name + random string in order to avoid replacing an image with same name + extension
        $local_file_name= $base_name . '_' . $this->generateRandomString(10). $extension;
       
        $complete_local_path = $local_path . $local_file_name;
        $local_link='/administrator/components/com_rssaggregator/assets/images/'.$local_file_name;
       
        // test for success
        if (copy($link, $complete_local_path)) {
            $flag_success=true;
        }
           
           
        if ($flag_success===true) {
            return $local_link;
        } else {
           return false;
        }
    }



3. I concluded that is difficult to say if the links found in a RSS Feed article are good or bad so I let the Joomla admin to decide if he want to allow them as they are or replace the href value with '#'. This will make them point to the start of the article.

if ($allow_links===0) {
           
            $DOM2 = new DOMDocument;
            $DOM2->loadHTML($editedHTML);

            //get all anchors and change them to #
            $items = $DOM2->getElementsByTagName('a');
      
            foreach($items as $item){
                if($item->hasAttribute('href')) {
                    $item->setAttribute('href','#'); 
                }
            }
            //save changes made to $editedHTML
            $editedHTML=$DOM2->saveHTML();
           
        }


Fork me on GitHub!
 
I have added this project on GitHub, you can download the code from here https://github.com/cristianpana86/com_rssaggregator