Using Q-Reinforced Learning to Generate Digital Logic

Trying to make my job obsolete.

Since I started my digital design career, I had an itch that something seemed off. Typically, as a team, we could get to a point where the inputs and outputs of a block were well defined (or frozen), and the behavior we needed was not only written down nicely in the specification, but verification already had an entirely abstract testbench that could tell whether a block was working or not. But even then, weeks are spent running tests, manually finding failures, and fixing a design block whose behavior is already implemented inside the software, just in another form. It's like we already told the computer what it's supposed to do; we already told it what information it needed, so why can't it iterate through these tests and make adjustments automatically? Ultimately, the computer better comprehends whether a fix would work (or entire blocks of logic, for that matter) rather than using the context we can keep in our human heads. But I am thankful that this little corner of inefficiency exists, which gives me purpose. But at the end of the day, it would be neat if a computer could learn and close this logic bug, to fix, to verify, to logic bug loop without a designer needed so the designer could focus on the items that give the most marginal gains trying to get the most out of the 80/20 rule as possible. 

Why Q-Table Reinforced Learning?

When I first learned machine learning at a high level, taking an input, testing the output, and modifying the middle to get closer to an answer felt like a loop that fits well with the problem above. So, getting this machine learning cycle to fit the design cycle could get us a solution that could help solve some of these long manual cycle times. (At least for the more minor stuff, as I still need a job). Granted, my attempts at applying machine learning to designing Hardware were initially naive. I tried using Binary Neural Nets to map an input to an output and overtrain it until it gave weights that could tie directly to Hardware. However, getting a BNN to learn, let alone overfit to something, is a nondeterministic task. Not only that, if you have the whole set of inputs and outputs, we already have optimization methods and logic generation algorithms to do this. So it was just reinventing the wheel anyway. Then I got into approximate computing, where you can use Machine Learning to try and run more significant equations faster by not being as accurate; however, this area of study still misses the mark of what I am trying to go after.

At this point, I broke down precisely what, as an ASIC design team, we typically have:

  • We have some input, which we know the entire space of, whether it be an interface, group of events or instructions. 

  • We have a specific specification of output behavior based on that input.

  • We can grade the output of a block which takes this input through a verification abstraction. Test benches can often tell how far a test goes before failing, how far off bandwidth requirements we are, or how accurate a particular block is.

Then, when I started to learn about reinforced learning, it would fit this cycle well and be able to use the collateral we already have above. We want to make design iterations and grade each one utilizing a testbench that grades the design in some way based on a specification. Using this grade, we can then update the design to get closer to the intended function. One issue with the advanced RL methods is that they use DNNs to transform the input into a result and grade it. Using a DNN means this block would have to run an entire Deep Neural Net on the inputs to get the outputs. Running a whole DNN is likely not viable, as problems at this level are addressed through accelerators of various sizes and, in addition, take multiple cycles. Many of the simple tasks we would want to start with will need to be in the single-cycle or couple-cycle domains. Starting out trying to have a computer generate a complex multi-cycle design is a way to chew off a lot more than you can handle. But then, looking at the origins of RL methodologies, I found Q-Table learning, which fits the bill. It's a reinforced learning method that uses a table to keep track of knowledge, and tables are everywhere when it comes to digital design implementation. 

What is Q-Table Reinforced Learning?

Forgive me if I butcher this some, as there is a lot more math behind why this can work; Q-Table learning effectively is a method that trains a table to translate an environment observation to possible actions. This means that knowing the state of the environment ( which in the case of Hardware could be inputs, state machine states, and counters ) and translating it to an action ( which could be a selection of an arbiter, multiplexer, or a next state in a state machine even ). The computer can learn a table that can create arbitrary combinational logic and state machines to accomplish some high-level goal. 

The table has a row for all possible states the environment is in. Effectively, the environmental state is the address in the table. Each table column is any possible action the table can make. The values in each column are a potential reward for selecting that Action given that state. You can use the table to look up an action that has the best reward for a specific state the environment is currently in. That Action is then taken, which modifies the environment. The table then can make another selection based on this new environment and so on. This fits very well into Hardware ( and state machines in general ) as there is some current state, based on inputs you want to change state the next clock cycle. That next clock cycle, you then make another state selection.

Suppose we can get this table to learn how to perform actions to conform to some arbitrary general function. In that case, we have created an arbitrary way to make Hardware based on a rule set versus precise and behavioral transformations like addition, shifts, selects, etc. This is because the Q-table can easily be translated into logic using our existing methods versus trying to evaluate a Deep Neural Network, which is a task itself. Effectively, all you need to do to make a table into logic is go through the table and map the rows to actions, which is just a significant case statement. Any logic compiler can reduce this and implement this in gates. 

In the next section, we detail how exactly we train this table to get to where it knows what Action to take based on the environment.

How do we implement Q-Table Learning?

Figure 1. Loop of Q Reinforced Learning

Initialization

Initially, we will construct a new Learning Agent with some Learning Rate, Epsilon Start, Epsilon End, and Epsilon Decrement amount. 

  • Learning Rate is the amount the reward changes in the table based on the cost function. The higher it is, the faster it learns, but it is more likely to overshoot convergence points. One way to address this is to reduce the learning rate as the number of games progresses, much like the following parameter.

  • Epsilon is the amount of randomness the Agent will use when selecting an action based on the environment. When you do reinforced Q-Table Learning, you have to start with some randomness to populate the table with approximate values. Otherwise, once it finds a way to "win" the game, it will never deviate from that path because a guaranteed win is better than a possible loss. So the parameters are how much randomness we start with, how much we end on, and how much we decrease randomness as the games go on to fine-tune the Q-Table.

Main Loop

The main loop will consist of several games. The best way to operate this loop is to perform a minimum amount of rounds and then start saving the best game performed. The table will gradually learn over all of the games it has run. 

In each loop, you run one game; initially, you reset the environment to some initial state. Then, while the environment says that the game is not done, perform this loop.

  • Choose an Action based on the observed environment.

  • Tell the environment the Action you took, and you get back to the next state of the environment, the reward for that Action, and whether the game is done.

  • You then tell the Agent the initial environment, the Action, and the reward it got. The Agent will then update its Q-Table row for that environment and the column for that Action based on the reward and learning rate. 

  • Then, set the current environment based on the next state of the environment that was returned.

The code for this initialization and main loop is below. This code involves most of the hyperparameter tuning you will do for the task you want to learn. However, this code generally can be used for any task and isn't application-specific. 

import matplotlib.pyplot as plt
import numpy as np
from q_learning_agent import Agent
from arbiter_enviroment_3x2 import ArbiterEnv3x2

import pickle

if __name__ == '__main__':
    # Initialize enviroment and agent
    env = ArbiterEnv3x2()
    agent = Agent(lr=0.003, gamma=0.9, 
                  eps_start=1.0, eps_end=0.01, 
                  eps_dec=0.9999999991, 
                  n_actions=env.numberOfActions, 
                  n_states=env.numberOfStates)

    scores = []
    steps = []
    info = []
    win_pct_list = []
    step_list = []
    n_games = 2500000

    prefix = "3x2_no_state_" 

    for i in range(n_games):
        done = False
        observation = env.reset()
        score = 0
        while not done:
            action = agent.choose_action(observation)
            observation_, reward, done, info = env.step(action)
            agent.learn(observation, action, reward, observation_)
            score += reward
            observation = observation_

        scores.append(score)
        steps.append(info["cycle"])
        if i % 10 == 0:
            win_pct = np.mean(scores[-10:])
            stp_mean = np.mean(steps[-10:])
            win_pct_list.append(win_pct)
            step_list.append(stp_mean)
            if i % 100 == 0:
                win_pct_list.append(score)
                print('episode ' , i, 
                      'avg_score %.2f' % win_pct, 
                      'epsilon %.2f' % agent.epsilon, 
                      'avg_steps %.2f' % stp_mean)

Agent

The Agent is also not application-specific, and the code is below. The job of the Agent is to be the keeper of the Q-Table. The Agent can take in an environmental state and perform an action based on the Q-Table. The Agent can also learn by taking a reward for a specific action and updating the Q-Table. 

import numpy as np
import json 

class Agent():
    def __init__(self, lr, gamma, n_actions, n_states, eps_start, eps_end, eps_dec):
        self.lr = lr
        self.gamma = gamma
        self.n_actions = n_actions
        self.n_states = n_states
        self.epsilon = eps_start
        self.eps_min = eps_end
        self.eps_dec = eps_dec

        self.Q = {}

        self.init_Q()

	# Initialize table to all 0
    def init_Q(self):
        for state in range(self.n_states):
            for action in range(self.n_actions):
                self.Q[(state,action)] = 0.0

	# Choose action depending on randomness of Epsilon
    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            action = np.random.choice([i for i in range(self.n_actions)])
        else:
            actions = np.array([self.Q[(state, a)] for a in range(self.n_actions)])
            action = np.argmax(actions)

        return action

    def decrement_epsilon(self):
        self.epsilon = self.epsilon*self.eps_dec if self.epsilon>self.eps_min else self.eps_min

	# Update Table based on a Reward for an Action in a Given State
    def learn(self, state, action, reward, state_):
        actions = np.array([self.Q[(state_,a)] for a in range(self.n_actions)])
        a_max = np.argmax(actions)

        self.Q[(state,action)] += self.lr*(reward + self.gamma*self.Q[(state_,a_max)] - self.Q[(state,action)])

        self.decrement_epsilon()

	# Write Table to File
    def writeToFile(self, prefix):
        dictToWrite = {}
        for state in range(self.n_states):
            stateDict = {}
            for action in range(self.n_actions):
                stateDict[action] = self.Q[(state,action)] 
            dictToWrite[state] = stateDict
        f = open(prefix + "_qTable.txt", "a")
        f.write(json.dumps(dictToWrite, indent = 4))
        f.close()

Enviroment

The environment is the application-specific aspect of this learning method, but at a high level, an environment is just a set of variables in a specific state. At the beginning of the game, these variables are set to an initial condition, and then as actions are performed on the environment, it will tell the Agent a reward for its Action and update its state based on what Action was taken.

Can we make something useful? 

So I thought a switch would be a first good application, a confined problem where the cost function can be easily defined. We also could combine multiple switches for future work to make a network that follows some system-level rules.

For the first try of the switch, I added no custom logic to it, and essentially, any state machine/counter the table needed would have to construct itself. This, however, explodes the state space and isn't efficient for iterative trials in experimentation, so I cheated some. In the environment, I provide the table with a counter of how many data beats were sent in a particular window for a target port as input to the table. This was a simple enough construct where the table still had to learn a lot of logic.

Next, I decided to make the first switch a 3 in x 2 out switch so as not to make it too big that it again explodes the state space, but it's a realistic building block for more extensive networks or internal arbitrations for queue schedules and so on. The task is to limit the output bandwidth so that only 20% of the BW of either port can be used. So, it's both a rate limiter and a load balancer. Now that we have our application, we must define the table's inputs and actions.

The Q-Table will take in:

  • Valid from all of the input ports. (I optimized this as it can be represented as 0 beats.)

  • Size from all the input ports (0,1,2 or 4 beats)

  • Route to output ports. (Port 0 or 1)

  • Counter for each output port. (Counts up to 16 beats in a 16-cycle window)

The actions of the Q-Table are:

  • Select Neither

  • Select Input 0

  • Select Input 1

  • Select Input 2

  • Select Input 0,1

  • Select Input 0,2

  • Select Input 1,2

Figure 2. Microarchitectural View of the Switch

Environment and Cost Function

The environment for each game will fill three queues for each input port with 200 random packets of random size (1 to 4 beats) and random targets (0 or 1) at initialization. The environment will consider the game done once all queues are empty.

When a selection of input queue occurs, that packet will be popped, and that target port will be busy for the selected beats length amount of cycles. If there is a conflict where the Q-Table selects an input to go to a target and that target port is active, the selection will not occur. If that happens, there is a significant negative reward, as that isn't an "efficient" selection in my mind. The table should choose to select Neither or not select a specific target if it can't select an output port. 

The cost function looks at the total transmitted for each port and the ideal amount for each port. The reward is the negative difference between the ideal and the actual. ( More of a punishment ). Then, a negative reward of -100 if there is a route clash. 

Below is the code for the environment, which also contains the reward function. Output buffer code is in the appendix to shorten the amount of code a bit :

import gym
import math
from collections import deque
import random
from output_buffer_agent import OutputBuffer
import matplotlib.pyplot as plt

class ArbiterEnv3x2(gym.Env):
    def __init__(self):
        self.cycle = 1
        self.inputBuffer0 = deque()
        self.inputBuffer1 = deque()
        self.inputBuffer2 = deque()

        self.numberOfStates = 131072
        self.numberOfActions = 7

        self.outZeroBwArray = []
        self.outOneBwArray = []

        self.targets = [0.20,0.20]
        self.targetArray0  = []
        self.targetArray1 = []

        self.outputBuffers = [
            OutputBuffer(),
            OutputBuffer()
        ]
        
    def step(self, action):
        self.cycle = self.cycle + 1
        done = False
        clash = False
        routeClash = False

        route0 = self.inputBuffer0[0]["route"] if (self.inputBuffer0) else 0
        len0 = self.inputBuffer0[0]["len"] if (self.inputBuffer0) else 0

        route1 = self.inputBuffer1[0]["route"] if (self.inputBuffer1) else 0
        len1 = self.inputBuffer1[0]["len"] if (self.inputBuffer1) else 0

        route2 = self.inputBuffer2[0]["route"] if (self.inputBuffer2) else 0
        len2 = self.inputBuffer2[0]["len"] if (self.inputBuffer2) else 0

        # Execute Action
        # Pick 0 
        if action == 0:
            if (not self.outputBuffers[route0].canSelect()):
                clash = True
            elif( self.inputBuffer0 ):
                self.outputBuffers[route0].select(len0)
                self.inputBuffer0.popleft()

        # Pick 1
        if action == 1:
            if (not self.outputBuffers[route1].canSelect()):
                clash = True
            elif( self.inputBuffer1 ):
                self.outputBuffers[route1].select(len1)
                self.inputBuffer1.popleft()

        # Pick 2
        if action  == 2:
            if (not self.outputBuffers[route2].canSelect()):
                clash = True
            elif( self.inputBuffer2 ):
                self.outputBuffers[route2].select(len2)
                self.inputBuffer2.popleft()

        # Pick 0,1
        if action == 3:
            if (route0 != route1):
                if (not self.outputBuffers[route0].canSelect()):
                    clash = True
                elif( self.inputBuffer0 ):
                    self.outputBuffers[route0].select(len0)
                    self.inputBuffer0.popleft()

                if (not self.outputBuffers[route1].canSelect()):
                    clash = True
                elif( self.inputBuffer1 ):
                    self.outputBuffers[route1].select(len1)
                    self.inputBuffer1.popleft()
            else:
                routeClash = True
                

        # Pick 0,2
        if action == 4:
            if(route0 != route2) :
                if (not self.outputBuffers[route0].canSelect()):
                    clash = True
                elif( self.inputBuffer0 ):
                    self.outputBuffers[route0].select(len0)
                    self.inputBuffer0.popleft()

                if (not self.outputBuffers[route2].canSelect()):
                    clash = True
                elif( self.inputBuffer2 ):
                    self.outputBuffers[route2].select(len1)
                    self.inputBuffer2.popleft()
            else :
                routeClash = True             

        # Pick 1,2
        if action  == 5:
            if(route1 != route2) :
                if (not self.outputBuffers[route1].canSelect()):
                    clash = True
                elif( self.inputBuffer1 ):
                    self.outputBuffers[route1].select(len2)
                    self.inputBuffer1.popleft()

                if (not self.outputBuffers[route2].canSelect()):
                    clash = True
                elif( self.inputBuffer2 ):
                    self.outputBuffers[route2].select(len2)
                    self.inputBuffer2.popleft()
            else :
                routeClash = True              

        # Finish Game if all Input Buffers are Empty
        if (not done):
            allEmpty = ((len0 == 0) and (len1 == 0) and (len2 == 0))

            done = allEmpty
            
        len_choices = [0,1,2,4]

        len0state =  len_choices.index(self.inputBuffer0[0]["len"]) if (self.inputBuffer0) else 0
        len1state =  len_choices.index(self.inputBuffer1[0]["len"]) if (self.inputBuffer1) else 0
        len2state =  len_choices.index(self.inputBuffer2[0]["len"]) if (self.inputBuffer2) else 0

        route0state =  self.inputBuffer0[0]["route"] if (self.inputBuffer0) else 0
        route1state =  self.inputBuffer1[0]["route"] if (self.inputBuffer1) else 0
        route2state =  self.inputBuffer2[0]["route"] if (self.inputBuffer2) else 0

        state = to1D(len0state, 4,\
                     len1state, 4,\
                     len2state, 4,\
                     route0state, 2,\
                     route1state, 2,\
                     route2state, 2,\
                     0, 1, 
                     0, 1,  
                     self.outputBuffers[0].getState(), 16, \
                     self.outputBuffers[1].getState(), 16  \
                     )

        # Calculate Reward
        
        info = []
        total = 0

        for i in range(0, len(self.outputBuffers)):
            total = total + self.outputBuffers[i].total_transmitted

        number_within_tolerence = 0

        self.outZeroBwArray.append(self.outputBuffers[0].total_transmitted/self.cycle)
        self.outOneBwArray.append(self.outputBuffers[1].total_transmitted/self.cycle)

        self.targetArray0.append(self.targets[0])
        self.targetArray1.append(self.targets[1])

        for i in range(0, len(self.outputBuffers)):
            if ( total != 0 ):
                ideal = (self.cycle * self.targets[i])
                actual = self.outputBuffers[i].total_transmitted
                number_within_tolerence = number_within_tolerence - abs((ideal - actual)/ideal)
            else : 
                number_within_tolerence = 0

        info = {
            "out0" : self.outZeroBwArray,
            "out1" : self.outOneBwArray,
            "goal0" : self.targetArray0,
            "goal1" : self.targetArray1,
            "cycle" : self.cycle
        }

        reward = -100 if routeClash else number_within_tolerence

        for i in range(0, len(self.outputBuffers)):
            self.outputBuffers[i].step()

        return state, reward, done, info

    # Define the reset state of the enviroment
    def reset(self):
        self.cycle = 1
        len_choices = [1,2,4]
        route_choices = [0,1]

        self.inputBuffer0 = deque()
        self.inputBuffer1 = deque()

        self.outZeroBwArray = []
        self.outOneBwArray = []

        self.targetArray0  = []
        self.targetArray1 = []

        self.outputBuffers = [
            OutputBuffer(),
            OutputBuffer()
        ]

        # Fill Queue0
        for i in range (0, 200):
            self.inputBuffer0.appendleft(
                {
                    "len" : random.choices(len_choices,k=1)[0],
                    "route" : random.choices(route_choices,k=1)[0]
                }
            )

        # Fill Queue1
        for i in range (0, 200):
            self.inputBuffer1.appendleft(
                {
                    "len" : random.choices(len_choices,k=1)[0],
                    "route" : random.choices(route_choices,k=1)[0]
                }
            )

        # Fill Queue2
        for i in range (0, 200):
            self.inputBuffer2.appendleft(
                {
                    "len" : random.choices(len_choices,k=1)[0],
                    "route" : random.choices(route_choices,k=1)[0]
                }
            )

        state = 0
        return state

Results and Conclusions 

For 2.5 Million Games, a Learning Rate of 0.004, a Gamma of 0.9, an Epsilon Start of 1, an Epsilon End of 0.01, and an Epsilon decrease of 0.0000005% showed promising results. The plot show the percent utilization by cycle with the horizontal line as the goal. Withe the Orange and Red lines you can see the initial BW stays about 50/50 with the initial random table. Then at the last round the table is able to maintain the goal bandwidth as shown by the Blue and Green lines.

Figure 3. 20/20 Switch Results

Also, plotting the score, you can see that learning isn’t very stable. Typically I found that the stability of the learning relates greatly to the scoring function. For example the large spike could be that the table learned to not select conflicts verses getting to the actual objective. But overall when implementing something like this its not always longer is better, and sometimes you do need to save a local maxima. It just so happened here the final table performed well.

Figure 4. Score as Rounds Progress

Once the table is constructed I wanted to get a quick estimate of how implementable it is. To do this we could look at the worst case of a 17 bit truth table for each bit, however I used pyEDA and Espresso to see if there is any possible optimization of the table. Worst case you would have a 17 element condition feed a 17 level deep logic tree. This would give log2(17) + 17 levels which is about 21 levels of logic. This is pretty significant but doable. This is a worst case assuming only two input privatives’ are available.

I optimized the code using logic reduction inside pyEDA [4] on only the lower bit of the action (as 0,1 bits change most often and the 2nd slightly less). Just selecting 1 bit allows this optimization to run in a couple of hours as the full truth table has 17 bits of input meaning it is 131k row deep. The final optimized logic tables would have a 7423 bit logic tree reducing logic segments of 15 expressions or less. This gives us about 17 logic levels (log2ceil(7423 ) + log2ceil(15)) which is better. If we could use elements with three inputs this reduces to about 12 logic levels (log3ceil(7423) + log3ceil(15)). I expand on this more in the Future Work section but the reduction and optimization of the table is an area that will need to be looked into to make this a viable method in making digital circuits.

Finally I plotted the states visited over a 10000 game window at the start, middle, and late games to see how the learned actions modify what states the environment is in. Also to see how much of the final table states arent visited, thus arent needed and could be compressed.

Figure 5. States visited for specific windows of training.

There are two main things I want to point out from the plot above:

  • There is some initial randomized state, however when training it looks like through the randomization it causes the environment to go into a relatively even distribution of states. This is very useful as this allows the table to see and eventually train when visiting these states. In this sense its almost good that a state space might be larger giving the algorithm room to learn.

  • The final state space looks like the table attempts to perform actions to keep the environment in some current steady state. This is probably application specific however as the goal is a steady state of bandwidth on each of the ports.

Overall the above logic level calculations was assuming about a third of the states were don’t care as the table had no rewards for those actions, however it looks like the table can be further optimized as states may only be visited in a transitory period. When removing these states from the truth table there is probably a tradeoff of performance and optimization threshold.

Improvements

If you put a programmable monitor instead of the specific beat counters I put in, then the switch could have a much more general use case. In the end, this showed an excellent approach to a problem where you know high-level rules, you know what state is needed to enforce those high-level rules, but you either fall short of being able to find a way to represent it behaviorally or want to automate it you can let a computer to try and find the logic to glue everything together. 

Related and Motivating Work

There is one major related work that does something very similar that the Chief Architect at my company found after I finished titled "Toward More Efficient NoC Arbitration: A Deep Reinforcement Learning Approach" [1] that I would recommend also going to go read. They analyze arbitration schemes at a NoC level where each arbiter can learn how to efficiently schedule based on information such as hops and distance to the target, not just local data. They also use deep Q-Learning and not just Q-Learning. This means that the state needs to be evaluated by a Deep Neural Net, which is where my work and this work diverge. I believe for this application, a learned table is more applicable than a full DNN. They moved to deep learning because they say the state space is vast (which it is); however, if you keep all local information for learning, you don't have to scale the Q-Table to consider global knowledge like hops and distance, which may be untenable.  

A similar approach is taken in "Deep Reinforcement Learning Enabled Self-Configurable Networks-on-Chip for High-Performance and Energy-Efficient Computing Systems" [2], which uses deep learning for the Q function rather than a table similar to the above paper. A common theme in these works is the use of global information to train a network that focuses on using a DNN versus a more modular approach, which can mitigate the scaling issues and allow us to use a table instead. 

Another more motivating work is "Learning-based Phase-aware Multi-core CPU Workload Forecasting" [3], where they were able to look into predicting phases of traffic the CPU might be moving into. It would be interesting if there could be a learned arbitration scheme that also takes in what phase of workload the CPU is in to configure the network to perform for that phase. To do this, you would need training data for each phase to train different Q-Agent tables. Then, the Q-Agent is selected based on the workload phase the CPU is in. ( You could also load whole new tables based on the workload granularity you want to support. )

Future Work

Hierarchical Modules

The next step is to connect these switches and see if a NoC scheduling scheme can be constructed where each switch can arbitrate based on local information and then form a NoC that satisfies higher-order rules. To do this, I would start working on making a general framework that can construct systems of local Q-Learning tables to fulfill some system-level cost functions.  

Modules with State

While the results are promising and engaging, this example constructs some case statements to satisfy a higher-order scheduling rule. When I first set out, I wanted for it also to be able to build its state machine; however, when I try and append bits so the logic can make its own "next state" signal, it learns less efficiently. So to go more into this, it would be better to work on an example with a smaller input state space so there can be more focus on learning a generalized state machine. 

Table Compression

Finally, another exciting area where some research could be done is to compress a learned Q-Table to have a smaller footprint in Hardware. The table should be more extensive, especially if we try and have modules with the state to give the algorithm room to learn; however, only a tiny part of the table will ever need to be implemented after it has been learned. In terms of combinatorial logic, it could be removing illegal input states, and in terms of modules with state, it could be rows that are impossible to reach from the learned state machine. So, while the learned table could be huge as it needs to be fully representative, there should be a way to compress it for final implementation to get equivalent or near equivalent behavior. 

For example below shows how many of the rows aren’t visited. In the case of this table approximatly 33% of all rows were a dont care.

Figure 6. Example of Rows not visited inside table.

Final Words

Thanks for reading, and feel free to shoot me any questions.

Citations

[1] J. Yin, S. Che, M. Oskin, and G. H. Loh, "Toward More Efficient NoC Arbitration: A Deep Reinforcement Learning Approach," in Proceedings of the Conference Name, 2018, pp. 123-130. [Online]. Available: https://api.semanticscholar.org/CorpusID:49254411

[2] M. F. Reza, "Deep Reinforcement Learning Enabled Self-Configurable Networks-on-Chip for High-Performance and Energy-Efficient Computing Systems," in IEEE Access, vol. 10, pp. 65339-65354, 2022, doi: 10.1109/ACCESS.2022.3182500.

[3] E. S. A. Lozano and A. Gerstlauer, "Learning-Based Phase-Aware Multi-Core CPU Workload Forecasting," ACM Trans. Des. Autom. Electron. Syst., vol. 28, no. 2, pp. 23, Mar. 2023. [Online]. Available: https://doi.org/10.1145/3564929

[4] https://pyeda.readthedocs.io/en/latest/

[Honorable Mention] https://github.com/dreylago/logicmin (Coulden’t finish logic optimization but has nice interface)

Appendix : Output Buffer Code

class OutputBuffer():
    def __init__(self):
        self.len_transmitting = 0
        self.total_transmitted = 0
        self.state = 0
        self.stateMax = 16

    def select(self, len):
        self.len_transmitting = len

        if( self.state + (3*len) >= 15): 
            self.state = 15
        else :
            self.state = self.state + (len*3)

    def step(self):
        if(self.state - 1 <= 0) :
            self.state = 0
        else:
            self.state -= 1

        if (self.len_transmitting > 0) :
            self.total_transmitted = self.total_transmitted + 1
            self.len_transmitting = self.len_transmitting - 1

    def canSelect(self):
        return (self.len_transmitting == 0)

    def getState(self):
        return self.state

def to1D( x0, x0max, x1, x1max, x2, x2max, x3, x3max, x4, x4max, x5, x5max, x6, x6max, x7, x7max, x8, x8max, x9, x9max ):
    x0_total = x0
    x1_total = x1 * x0max
    x2_total = x2 * x1max * x0max
    x3_total = x3 * x2max * x1max * x0max
    x4_total = x4 * x3max * x2max * x1max * x0max
    x5_total = x5 * x4max * x3max * x2max * x1max * x0max
    x6_total = x6 * x5max * x4max * x3max * x2max * x1max * x0max
    x7_total = x7 * x6max * x5max * x4max * x3max * x2max * x1max * x0max
    x8_total = x8 * x7max * x6max * x5max * x4max * x3max * x2max * x1max * x0max
    x9_total = x9 * x9max * x7max * x6max * x5max * x4max * x3max * x2max * x1max * x0max
    return x0_total + x1_total + x2_total + x3_total + x4_total + x5_total + x6_total + x7_total + x8_total + x9_total
Previous
Previous

Leveraging Binary Neural Networks for Efficient Malware Detection

Next
Next

Using Financial Networks to Predict Stock Market Volatility